An Analysis of Human Life Expectancy
Data Science Pipeline Tutorial
Austin Miller
December 2024
Purpose
This project, developed as a tutorial for the data science pipeline, was guided by the University of Maryland's Master's course, Principles of Data Science, during the Fall 2024 semester. It covers essential topics in data science and analytics, including data curation and cleaning, exploratory data analysis, analytical prediction, and stakeholder reporting on discovered insights. Focusing on human life expectancy, the tutorial adopts a real-world analytical approach to uncover the factors influencing life expectancy. Rather than being a polished product for professional presentation, this tutorial serves as a step-by-step, practical guide, offering an educational example of how to apply the data science pipeline in a real-world scenario.
Goal
The goal of this tutorial is to scientifically explore the trends influencing human life expectancy through data analysis and statistical machine learning models. By taking an exploratory approach, we aim to uncover actionable insights and inform future predictions of life expectancy, free from preconceived biases. Along the way, this tutorial introduces the methodology behind the data science pipeline, providing a comprehensive foundation for analytical exploration.
Table of Contents
- Data Curation
- Data Cleaning
- Exploratory Data Analysis
- Model Engineering
- Predicting the Future of Human Life Expectancy
- Report Summary Phase of the Data Science Pipeline
In [1]:
# Library Imports (More packages and libraries are added where appropriate)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import os
# This option enables the entire dataview for dataframes (all rows)
# Extremely useful when exploring your data
pd.set_option('display.max_rows', None)
For our analysis on Human Life Expectancy, we are going to use data from a reputable source, The United States Census Bureau. Choosing reputable and trusted data is fundamental for accurate analysis as a reputable source is more likely to adhere to better standards of data integrity. Havard Business School’s Catherine Cote provides the following definition for data integrity:
“Data integrity is the accuracy, completeness, and quality of data as it’s maintained over time and across formats. Preserving the integrity of your company’s data is a constant process.”
The United Stated Census Bureau should provide us with data that adheres to good data integrity practices. The data provided by the US Census Bureau is an international dataset that contains 227 countries world-wide. The dataset offers you the ability to select a year range starting with 1950, but warns us that values before 2000 cannot be trusted. Since data integrity is fundamental to data analysis, we are going to heed the warning and only select years in the range of 2000-2024. This dataset also allows us to add custom additional columns before downloading the data. The features selected for this tutorial can be found in the table below. To access the US Census Bureau data, follow this link:
Features and Descriptions
| Feature | Description |
|---|---|
| Name | Country Name |
| GENC | Geopolitical Entities, Names, and Codes (GENC) |
| Year | The reference year for the data. |
| Population | All people, male and female, child and adult, living in a given geographic area. |
| Annual Growth Rate | The average annual percent change in the population, resulting from a surplus (or deficit) of births over deaths and the balance of migrants entering and leaving a country. |
| Rate of Natural Increase | The difference between the crude birth rate and the crude death rausually expressed as a percent rather than per 1,000 population. |
| Population Density | Total population within a geographic entity divided by the land area of that entity, expressed as "people per square kilometer" |
| Total Fertility Rate | The average number of children that would be born per woman if all women lived to the end of their childbearing years and bore children according to age-specific fertility rates. |
| Crude Birth Rate | The average annual number of births during a year per 1,000 population at midyear. |
| Life Expectancy at Birth, Both Sexes | The average number of years a group of people born in the same year can be expected to live if mortality at each age remains constant in the future. |
| Infant Mortality Rate, Both Sexes | The number of deaths of infants under 1 year of age from a cohort of 1,000 live births, denoted as IMR. |
| Crude Death Rate | The average annual number of deaths during a year per 1,000 population at midyear. |
| Net Migration Rate | The difference between the number of migrants entering and those leaving a country in a year, per 1,000 midyear population. |
| Births, Both Sexes | Total number of live births. |
| Deaths, Both Sexes | Total number of deaths. |
After downloading the data, use Pandas to read the csv into a Pandas dataframe.
In [2]:
# Set the file path as a variable (may need the full file path depending on your working directory)
file = 'IDB.csv'
# Use Pandas to load in the data
data = pd.read_csv(file)
# Create a copy of the data and leave the original data untouched.
# Practical approach to reference the unaltered data later.
df = data.copy()
Data cleaning is a critical step that can determine the success or failure of your data analysis and shape the foundation of your data science pipeline. It ensures that the data is complete, reliable, and prepared for meaningful analysis. Without properly cleaned data your analysis will be invalidated, and most likely your pipeline will fail. Proper data cleaning is one of the most important steps in the entire data science pipeline, if not the most important.
The following code chunks will demonstrate methods to properly explore and clean the US Census Bureau data we just downloaded.
In [3]:
# Check the first 5 rows of the dataframe
df.head(5)
# Note: If you do not add custom columns to the data, you will have to promote the first row to headers
# You can follow Zach Bobbitt's tutorial here: https://www.statology.org/pandas-set-first-row-as-header/
Out[3]:
| Name | GENC | Year | Population | Annual Growth Rate | Rate of Natural Increase | Population Density | Total Fertility Rate | Crude Birth Rate | Life Expectancy at Birth, Both Sexes | Infant Mortality Rate, Both Sexes | Crude Death Rate | Net Migration Rate | Births, Both Sexes | Deaths, Both Sexes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -> 2000 | NaN | NaN | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |
| 1 | Afghanistan | AF | 2000.0 | 23,603,559 | -0.79 | 3.52 | 36.2 | 7.4 | 46.7 | 56.4 | 87.6 | 11.5 | -43.1 | 1,102,185 | 270,980 |
| 2 | Albania | AL | 2000.0 | 3,158,350 | -1.02 | 1.1 | 115.3 | 2.17 | 16.4 | 74.7 | 22.6 | 5.4 | -21.2 | 51,877 | 17,028 |
| 3 | Algeria | DZ | 2000.0 | 30,634,947 | 1.4 | 1.46 | 12.9 | 2.37 | 19.2 | 71.8 | 38.5 | 4.6 | -0.6 | 588,628 | 141,637 |
| 4 | American Samoa | AS | 2000.0 | 57,699 | 0.09 | 2.65 | 291.4 | 4 | 30 | 74.9 | 10.6 | 3.5 | -25.6 | 1,730 | 202 |
When checking the first 5 rows of the data, we can instantly see that something is off. The very first row of the data is an identifier/separator for year, which we do not need because Year is already a feature of the dataset. Additionally, it appears the dataset is using "--" instead of NaN or NA, at least in the year separators. The first step in our cleaning process is to remove these year separators as well as replacing any occurence of "--" with NaN.
In [4]:
# Years are separated in their own rows and the dataset contain -- where NA's shoould be
# Replacing the -- with NA's
df.replace('--', np.nan, inplace=True)
# Extracting the dataset's year separators
df[df.Year.isnull()].head(3)
Out[4]:
| Name | GENC | Year | Population | Annual Growth Rate | Rate of Natural Increase | Population Density | Total Fertility Rate | Crude Birth Rate | Life Expectancy at Birth, Both Sexes | Infant Mortality Rate, Both Sexes | Crude Death Rate | Net Migration Rate | Births, Both Sexes | Deaths, Both Sexes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -> 2000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 228 | -> 2001 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 456 | -> 2002 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
In [5]:
# Dropping all of the year separators in-place
df.drop(df[df.Year.isnull()].index,inplace=True)
# Checking to see if any Year is Null
# Since our instance of the dataframe returned no results, we are good to move on
df[df.Year.isnull()]
Out[5]:
| Name | GENC | Year | Population | Annual Growth Rate | Rate of Natural Increase | Population Density | Total Fertility Rate | Crude Birth Rate | Life Expectancy at Birth, Both Sexes | Infant Mortality Rate, Both Sexes | Crude Death Rate | Net Migration Rate | Births, Both Sexes | Deaths, Both Sexes |
|---|
The previous code chunks removed the year separators and replaced all occurrences of "--" with NaN. Then we verified that there were no rows where Year was Null. Now let's take a look at the DataFrame and identify other cleaning requirements.
In [6]:
# Printing the number of countries present
print(f'Country Count: {df["Name"].nunique()}')
# Showing the first 5 Rows
df.head()
Country Count: 227
Out[6]:
| Name | GENC | Year | Population | Annual Growth Rate | Rate of Natural Increase | Population Density | Total Fertility Rate | Crude Birth Rate | Life Expectancy at Birth, Both Sexes | Infant Mortality Rate, Both Sexes | Crude Death Rate | Net Migration Rate | Births, Both Sexes | Deaths, Both Sexes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Afghanistan | AF | 2000.0 | 23,603,559 | -0.79 | 3.52 | 36.2 | 7.4 | 46.7 | 56.4 | 87.6 | 11.5 | -43.1 | 1,102,185 | 270,980 |
| 2 | Albania | AL | 2000.0 | 3,158,350 | -1.02 | 1.1 | 115.3 | 2.17 | 16.4 | 74.7 | 22.6 | 5.4 | -21.2 | 51,877 | 17,028 |
| 3 | Algeria | DZ | 2000.0 | 30,634,947 | 1.4 | 1.46 | 12.9 | 2.37 | 19.2 | 71.8 | 38.5 | 4.6 | -0.6 | 588,628 | 141,637 |
| 4 | American Samoa | AS | 2000.0 | 57,699 | 0.09 | 2.65 | 291.4 | 4 | 30 | 74.9 | 10.6 | 3.5 | -25.6 | 1,730 | 202 |
| 5 | Andorra | AD | 2000.0 | 65,099 | -0.12 | 0.61 | 139.1 | 1.37 | 11.5 | 81 | 4.9 | 5.4 | -7.3 | 747 | 349 |
In [7]:
# Check data types and get null counts
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 5675 entries, 1 to 5699 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 5675 non-null object 1 GENC 5650 non-null object 2 Year 5675 non-null float64 3 Population 5675 non-null object 4 Annual Growth Rate 5625 non-null object 5 Rate of Natural Increase 5625 non-null object 6 Population Density 5675 non-null object 7 Total Fertility Rate 5613 non-null object 8 Crude Birth Rate 5625 non-null object 9 Life Expectancy at Birth, Both Sexes 5613 non-null object 10 Infant Mortality Rate, Both Sexes 5613 non-null object 11 Crude Death Rate 5625 non-null object 12 Net Migration Rate 5625 non-null object 13 Births, Both Sexes 5625 non-null object 14 Deaths, Both Sexes 5625 non-null object dtypes: float64(1), object(14) memory usage: 709.4+ KB
There are a couple of issues as shown after running df.info(). The first issue is with our data types being objects and not floats. Name and GENC, being categorical, are appropriate as object types, but the other features should be floats. The second issue is that the Non-Null Counts show some concerning signs of uncleaned data. We can see in the Index output, "Index: 5675 entries", that we are expecting each feature to match their Non-Null count to the number of rows, 5675, but that is clearly not the case. The only columns that are complete are Name, Year, Population, and Population Density. These two findings are major issues that we must address, so let's start with the data types.
To correct the data types, we first must remove the commas from the strings that are present throughout the DataFrame. Without this step, the features won't properly convert to a numeric type.
In [8]:
# Defining the columns we want to convert to numeric
numeric_cols = list(df.columns)[2:]
# Iterating over the columns and replacing ',' with nothing ''.
for col in numeric_cols:
df[col] = pd.to_numeric(df[col].astype(str).str.replace(',', '', regex=True), errors='coerce')
# Making a copy of the data with the corrected data types.
# Same as preserving a copy of the original data we did previously.
# Practical approach to reference the unaltered data later.
df_all_countries = df.copy()
# Confirm data types after conversion
print(df.dtypes)
Name object GENC object Year float64 Population int64 Annual Growth Rate float64 Rate of Natural Increase float64 Population Density float64 Total Fertility Rate float64 Crude Birth Rate float64 Life Expectancy at Birth, Both Sexes float64 Infant Mortality Rate, Both Sexes float64 Crude Death Rate float64 Net Migration Rate float64 Births, Both Sexes float64 Deaths, Both Sexes float64 dtype: object
In [9]:
df.head()
Out[9]:
| Name | GENC | Year | Population | Annual Growth Rate | Rate of Natural Increase | Population Density | Total Fertility Rate | Crude Birth Rate | Life Expectancy at Birth, Both Sexes | Infant Mortality Rate, Both Sexes | Crude Death Rate | Net Migration Rate | Births, Both Sexes | Deaths, Both Sexes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Afghanistan | AF | 2000.0 | 23603559 | -0.79 | 3.52 | 36.2 | 7.40 | 46.7 | 56.4 | 87.6 | 11.5 | -43.1 | 1102185.0 | 270980.0 |
| 2 | Albania | AL | 2000.0 | 3158350 | -1.02 | 1.10 | 115.3 | 2.17 | 16.4 | 74.7 | 22.6 | 5.4 | -21.2 | 51877.0 | 17028.0 |
| 3 | Algeria | DZ | 2000.0 | 30634947 | 1.40 | 1.46 | 12.9 | 2.37 | 19.2 | 71.8 | 38.5 | 4.6 | -0.6 | 588628.0 | 141637.0 |
| 4 | American Samoa | AS | 2000.0 | 57699 | 0.09 | 2.65 | 291.4 | 4.00 | 30.0 | 74.9 | 10.6 | 3.5 | -25.6 | 1730.0 | 202.0 |
| 5 | Andorra | AD | 2000.0 | 65099 | -0.12 | 0.61 | 139.1 | 1.37 | 11.5 | 81.0 | 4.9 | 5.4 | -7.3 | 747.0 | 349.0 |
That is much better! All of our datatypes have been converted to their proper type. Now let's address the second issue, missing data.
First, let's check and make sure all countries have an entry for 2020-2024.
In [10]:
# Country value counts [where Country counts are less than (Max Year - Min Year)]
# Missing Country data
missing_countries = df['Name'].value_counts()[df['Name'].value_counts() < (max(df.Year) - min(df.Year))]
missing_countries
Out[10]:
Series([], Name: count, dtype: int64)
As indicated by the empty response in the previous code, all of our countries have entries for every year. Let's find out what exactly is missing then.
In [11]:
# Importing textwrap for axis labels. (!pip install textwrap if needed)
import textwrap
# Creating the wrapped labels for visual appeal and space saving
wrapped_labels = [textwrap.fill(column, width=15,break_long_words=False) for column in df.columns]
# Setting the plot size
plt.figure(figsize=(13,5))
# Creating the bar plot of missing values
# Summing the null values for each feature and dividing by the number of rows (converted to percent)
sns.barplot(x=wrapped_labels,y=df.isnull().sum()/len(df)*100)
# Setting axis labels
plt.xlabel("Column")
plt.ylabel("Percent Missing")
# Rotating the x-axis by 90 degrees.
plt.xticks(rotation=90)
# Set the plot title
plt.title("Percent of Missing Data by Feature")
# Display the plot
plt.show()
This visualization quickly enables us to determine the percentage of data missing for each feature. The majority of the features have between 0.8% and ~ 1% missing values, but GENC has around 0.4% of missing data. GENC is a country identifier, so we wouldn't expect that to have missing data. After a quick Google search, it's easy to identify what's wrong. Namibia's GENC, or ISO Alpha 2, country code is NA. These NA values are being recorded as NULL, so Namibia's GENC code was wrongly classisifed as missing. Let's add it back and then move on to the rest of the data.
In [12]:
# Namibia's ISO 2 (GENC) code is NA, this got removed in the data cleaning steps
df[df['GENC'].isnull()]['Name'].value_counts()
Out[12]:
Name Namibia 25 Name: count, dtype: int64
In [13]:
# Correcting Namibia's GENC
df.loc[df['Name'] == 'Namibia', 'GENC'] = 'NA'
In [14]:
# Making sure the corrections were made
df[df.GENC == 'NA'].head(3)
Out[14]:
| Name | GENC | Year | Population | Annual Growth Rate | Rate of Natural Increase | Population Density | Total Fertility Rate | Crude Birth Rate | Life Expectancy at Birth, Both Sexes | Infant Mortality Rate, Both Sexes | Crude Death Rate | Net Migration Rate | Births, Both Sexes | Deaths, Both Sexes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 141 | Namibia | NA | 2000.0 | 1877964 | 3.24 | 2.20 | 2.3 | 4.09 | 33.6 | 53.4 | 66.5 | 11.6 | 10.4 | 63023.0 | 21727.0 |
| 369 | Namibia | NA | 2001.0 | 1929692 | 2.21 | 2.24 | 2.3 | 4.10 | 34.0 | 53.3 | 65.1 | 11.6 | -0.3 | 65680.0 | 22443.0 |
| 597 | Namibia | NA | 2002.0 | 1964708 | 1.39 | 2.18 | 2.4 | 4.06 | 33.7 | 53.1 | 63.6 | 11.9 | -7.9 | 66170.0 | 23313.0 |
Missing Values: Annual Growth Rate
In [15]:
# Simple check of missing values, annual growth rate.
df[df['Annual Growth Rate'].isnull()]['Name'].value_counts()
Out[15]:
Name United States 12 Puerto Rico 10 Sudan 8 South Sudan 8 Libya 6 Syria 4 Honduras 1 Sri Lanka 1 Name: count, dtype: int64
We can see that there are 8 countries with missing entries for Annual Growth Rate, and ironically enough, the United States and its' territory Puerto Rico are missing the most entries. Given that we got the data from the official US Census Bureau website, this is really unfortunate. To save us some time so we don't have to check all features individually, let's check missing values by country to see if any patterns emerge.
In [16]:
# Gathering the counts of all missing data by country
percent_missing_by_country = df.set_index('Name').isnull().mean(axis=1) * 100
# Removing the countries that have no missing data
percent_missing_by_country = percent_missing_by_country[percent_missing_by_country > 0]
# Aggregating the countries to only show a single country once and dividing the sum by the number of years
percent_missing_by_country_unique = percent_missing_by_country.groupby(level=0).sum() / (max(df.Year) - min(df.Year))
# Setting the plot size
plt.figure(figsize=(12, 6))
# Sorting the x axis by percent missing
percent_missing_by_country_unique.sort_values(ascending=False).plot(kind='bar')
# Setting the plot title and xlab and ylab
plt.title('Percentage of Missing Data by Country', fontsize=16)
plt.xlabel('Country', fontsize=12)
plt.ylabel('Percent Missing', fontsize=12)
# Rotating the x tick labels
plt.xticks(rotation=90, fontsize=10)
# Showing the plot
plt.show()
Wow, we just got extremely lucky with only 8 countries total with missing data. The United States has an astounding number of missing values even though we got the data directly from the United States Census Bureau website. I don't think anyone would have expected that!
We must determine what to do about the missing data. We have three main options that we can take: find the country data, impute the country data, or remove the countries from our analysis. Unfortunately, for the United States we are going to remove the country from our analysis. There is far too much missing data, and we can't trust someone else to have their information if they don't even have their own information. That one is painful, but it must be done. We are also going to remove all of the countries except the bottom two. For Sri Lanka, and Honduras we will impute the data using the K-Nearest Neighbors imputer. Basically, we are going to use the country data available for each of them to predict their own missing data.
For more information on handling missing data with various techniques, see this article by Jason Brownlee.
In [17]:
# Define the countries we want to impute
countries_missing = ['Sri Lanka', 'Honduras']
# Check the missing rows for the data
missing_rows = df[df['Name'].isin(countries_missing) & df.isnull().any(axis=1)].sort_values(by='Name')
# Displaying the missing rows
missing_rows
Out[17]:
| Name | GENC | Year | Population | Annual Growth Rate | Rate of Natural Increase | Population Density | Total Fertility Rate | Crude Birth Rate | Life Expectancy at Birth, Both Sexes | Infant Mortality Rate, Both Sexes | Crude Death Rate | Net Migration Rate | Births, Both Sexes | Deaths, Both Sexes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 89 | Honduras | HN | 2000.0 | 6367242 | NaN | NaN | 56.9 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 192 | Sri Lanka | LK | 2000.0 | 19041167 | NaN | NaN | 294.6 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
In [18]:
# Importing the KNN imputer to fill in the missing values
from sklearn.impute import KNNImputer
# Gathering the Sri Lanka data
sri_lanka_data = df[df['Name'] == 'Sri Lanka']
# Initializing the KNN Imputer with 2 neighbors
knn_imputer = KNNImputer(n_neighbors=2)
# Imputing all missing data in the columns that contain numeric values
sri_lanka_data.iloc[:, 2:] = knn_imputer.fit_transform(sri_lanka_data.iloc[:, 2:])
# Updating the dataset with the new Sri Lanka
df.update(sri_lanka_data)
In [19]:
# Gathering the Honduras data
honduras_data = df[df['Name'] == 'Honduras']
# Initializing the KNN Imputer with 2 neighbors
knn_imputer = KNNImputer(n_neighbors=2)
# Imputing all missing data in the columns that contain numeric values
honduras_data.iloc[:, 2:] = knn_imputer.fit_transform(honduras_data.iloc[:, 2:])
# Updating the dataset with the new Sri Lanka
df.update(honduras_data)
In [20]:
df[df['Name'] == 'Sri Lanka'].head(3)
Out[20]:
| Name | GENC | Year | Population | Annual Growth Rate | Rate of Natural Increase | Population Density | Total Fertility Rate | Crude Birth Rate | Life Expectancy at Birth, Both Sexes | Infant Mortality Rate, Both Sexes | Crude Death Rate | Net Migration Rate | Births, Both Sexes | Deaths, Both Sexes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 192 | Sri Lanka | LK | 2000.0 | 19041167 | 1.10 | 1.31 | 294.6 | 2.365 | 19.2 | 73.1 | 12.55 | 6.05 | -2.15 | 367929.0 | 115871.0 |
| 420 | Sri Lanka | LK | 2001.0 | 19074889 | 1.01 | 1.28 | 295.1 | 2.350 | 19.1 | 72.6 | 13.10 | 6.20 | -2.80 | 363306.0 | 118406.0 |
| 648 | Sri Lanka | LK | 2002.0 | 19285932 | 1.19 | 1.34 | 298.4 | 2.380 | 19.3 | 73.6 | 12.00 | 5.90 | -1.50 | 372552.0 | 113336.0 |
In [21]:
df[df['Name'] == 'Honduras'].head(3)
Out[21]:
| Name | GENC | Year | Population | Annual Growth Rate | Rate of Natural Increase | Population Density | Total Fertility Rate | Crude Birth Rate | Life Expectancy at Birth, Both Sexes | Infant Mortality Rate, Both Sexes | Crude Death Rate | Net Migration Rate | Births, Both Sexes | Deaths, Both Sexes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 89 | Honduras | HN | 2000.0 | 6367242 | 2.46 | 2.705 | 56.9 | 4.045 | 32.95 | 68.3 | 28.35 | 5.9 | -2.45 | 217876.5 | 38920.0 |
| 317 | Honduras | HN | 2001.0 | 6535344 | 2.48 | 2.740 | 58.4 | 4.110 | 33.40 | 68.1 | 28.90 | 6.0 | -2.60 | 218109.0 | 38968.0 |
| 545 | Honduras | HN | 2002.0 | 6698323 | 2.44 | 2.670 | 59.9 | 3.980 | 32.50 | 68.5 | 27.80 | 5.8 | -2.30 | 217644.0 | 38872.0 |
The missing rows for Honduras and Sri Lanka are both filled, and look pretty reasonable. For the other 6 countries with missing data, we are going to remove them from the dataset, which will still give us 221 countries to analyze.
In [22]:
# Creating a list of countries to remove
remove_countries = ['Libya','Puerto Rico', "South Sudan", "Sudan","United States","Syria"]
# Writting over the data frame with the removed countries
df = df[~df['Name'].isin(remove_countries)]
In [23]:
# Checking to make sure all missing values have been accounted for
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 5525 entries, 1 to 5699 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 5525 non-null object 1 GENC 5525 non-null object 2 Year 5525 non-null float64 3 Population 5525 non-null int64 4 Annual Growth Rate 5525 non-null float64 5 Rate of Natural Increase 5525 non-null float64 6 Population Density 5525 non-null float64 7 Total Fertility Rate 5525 non-null float64 8 Crude Birth Rate 5525 non-null float64 9 Life Expectancy at Birth, Both Sexes 5525 non-null float64 10 Infant Mortality Rate, Both Sexes 5525 non-null float64 11 Crude Death Rate 5525 non-null float64 12 Net Migration Rate 5525 non-null float64 13 Births, Both Sexes 5525 non-null float64 14 Deaths, Both Sexes 5525 non-null float64 dtypes: float64(12), int64(1), object(2) memory usage: 690.6+ KB
Thats it for our missing data! We now have 5,525 rows and no missing values. That covers all of our data cleaning steps, for now. Every dataset is different, and all require different steps. It's fundamental that you analyze your data closely, and clean your data appropriately for the dataset you are working with. The next step of the data science pipeline is EDA, Exploratory Data Analysis. This is where we will begin to discover and visualize our data in depth. There may be more data cleaning steps we discover in the EDA phase, but that's part of exploring and learning the data. Let's get started!
Exploratory Data Analysis is one of the most exciting phases of the data science pipeline. It’s the beginning of a deeper investigation into your data, enabling you to get an understanding of the structure of the data by identifying underlying patterns and draw meaningful connections between your features and the real-world. The EDA phase of the data science pipeline allows you to visualize and interpret the data for a greater understanding, laying the groundwork for further analysis. During EDA you begin to make sense of your data by turning complexity into clarity through statistics, tables, graphs, and visualizations, making it an essential step in the data science pipeline. For a simple definition of EDA and further reading, check out the definition and link below from IBM.
“Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods.”
A good starting point for EDA is performing a descriptive analysis of your data, which can be quickly achieved using Pandas' describe() method.
In [24]:
# Descriptive Analysis of the data
df.describe()
Out[24]:
| Year | Population | Annual Growth Rate | Rate of Natural Increase | Population Density | Total Fertility Rate | Crude Birth Rate | Life Expectancy at Birth, Both Sexes | Infant Mortality Rate, Both Sexes | Crude Death Rate | Net Migration Rate | Births, Both Sexes | Deaths, Both Sexes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 5525.000000 | 5.525000e+03 | 5525.000000 | 5525.000000 | 5525.000000 | 5525.000000 | 5525.000000 | 5525.000000 | 5525.000000 | 5525.000000 | 5525.000000 | 5.525000e+03 | 5.525000e+03 |
| mean | 2012.000000 | 3.041586e+07 | 1.242454 | 1.266667 | 429.901593 | 2.711613 | 20.466597 | 72.022244 | 25.238643 | 7.803339 | -0.241738 | 5.888977e+05 | 2.361727e+05 |
| std | 7.211755 | 1.269315e+08 | 3.272166 | 1.079953 | 1853.432481 | 1.439489 | 10.787659 | 8.248048 | 24.761457 | 3.086721 | 31.312670 | 2.128713e+06 | 9.576988e+05 |
| min | 2000.000000 | 3.951000e+03 | -115.360000 | -1.400000 | 0.000000 | 0.680000 | 4.300000 | 29.600000 | 1.500000 | 1.200000 | -1152.600000 | 3.300000e+01 | 3.200000e+01 |
| 25% | 2006.000000 | 4.641450e+05 | 0.360000 | 0.410000 | 36.800000 | 1.690000 | 11.500000 | 67.700000 | 6.800000 | 5.700000 | -3.400000 | 6.954000e+03 | 2.354000e+03 |
| 50% | 2012.000000 | 5.238011e+06 | 1.080000 | 1.150000 | 89.300000 | 2.160000 | 17.400000 | 73.800000 | 15.400000 | 7.300000 | -0.300000 | 7.808400e+04 | 3.870100e+04 |
| 75% | 2018.000000 | 1.953645e+07 | 2.080000 | 2.050000 | 218.400000 | 3.380000 | 27.100000 | 77.800000 | 36.400000 | 9.400000 | 2.200000 | 4.663710e+05 | 1.551320e+05 |
| max | 2024.000000 | 1.412197e+09 | 121.110000 | 4.030000 | 23015.200000 | 8.300000 | 57.200000 | 89.900000 | 197.800000 | 33.400000 | 1211.800000 | 2.726854e+07 | 1.534324e+07 |
The descriptive analysis quickly calculates some metrics for us, but it's in the visualizations that you truly begin to understand your data. Let's create three plots that visualize the Population Per Year, Population Density by Year, and Net Migration by Year as an aggregate of all countries.
In [25]:
# Importing FuncFormatter to control the y-axis tick labels
from matplotlib.ticker import FuncFormatter
# Custom function to format y-axis as plain numbers
# Default is scientific notation
def format_number(x, _):
return f"{int(x):,}"
# Sum of population totals per year
# Here we are using all countries, which is why its important to save the unaltered data.
# (note) Our raw population data had no missing values
pop_by_year = df_all_countries.groupby('Year')['Population'].sum().reset_index()
# Creating the plot of Population by Year
plt.figure(figsize=(12,5))
sns.lineplot(data=pop_by_year,x='Year',y="Population")
plt.gca().yaxis.set_major_formatter(FuncFormatter(format_number))
plt.title('Population by Year')
plt.xlabel('Year')
plt.ylabel('Population')
plt.xlim(1999, 2025)
plt.show()
In [26]:
# Creating an aggregated population density metric by mean
pop_den_by_year = df.groupby('Year')['Population Density'].mean().reset_index()
# Creating the Mean Population Density Plot
plt.figure(figsize=(12, 5))
sns.lineplot(data=pop_den_by_year, x='Year', y='Population Density')
plt.gca().yaxis.set_major_formatter(FuncFormatter(format_number))
plt.suptitle('Mean Population Density by Year')
plt.title("People per sq. km.")
plt.xlabel('Year')
plt.ylabel('Population Density (Mean)')
plt.xlim(1999, 2025)
plt.show()
In [27]:
# Creating an aggregated net migration metric by mean
pop_by_year = df.groupby('Year')['Net Migration Rate'].mean().reset_index()
# Creating the net migration Plot
plt.figure(figsize=(12, 5))
sns.lineplot(data=pop_by_year, x='Year', y='Net Migration Rate')
plt.gca().yaxis.set_major_formatter(FuncFormatter(format_number))
plt.title('Average Net Migration Rate by Year')
plt.xlabel('Year')
plt.ylabel('Migration Rate (Mean)')
plt.xlim(1999, 2025)
plt.show()
These three plots are simple univariate visualizations displayed over time by year, but they unlock some underlying data about our world. Let’s give a brief description of what we found:
- World Population per Year:
There has been almost a 2 Billion increase in people since 2000! The population now sets ~8 Billion people world-wide. - Mean Population Density per Year:
The population density has risen by ~100 people per square kilometer on average world-wide. - Average Net Migration Rate:
From 2000 to 2010, the average net migration rate was erradic, with extreme movement in both directions, but after 2010 it began to stabilize.
The erratic net migration rate from 2000 to 2010 can partly be understood by looking at the next table, which shows high and low swings of net migration by Montserrat, and one high net migration in UAE. The Montserrat net migration rate can be explained by an active volcano on the island which has caused several migrations of people out of the country during evacuations, and then back into the country during resettling. More about the Soufrière Hills Volcano can be read on the Global Volcanism Program website sponsored by the Smithsonian Institution at the following link:
In [28]:
# High net migration in the data
high_net_migration = df[abs(df['Net Migration Rate']) > 250]
high_net_migration
Out[28]:
| Name | GENC | Year | Population | Annual Growth Rate | Rate of Natural Increase | Population Density | Total Fertility Rate | Crude Birth Rate | Life Expectancy at Birth, Both Sexes | Infant Mortality Rate, Both Sexes | Crude Death Rate | Net Migration Rate | Births, Both Sexes | Deaths, Both Sexes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 138 | Montserrat | MS | 2000.0 | 3951 | -115.36 | -0.10 | 38.7 | 1.34 | 12.2 | 72.6 | 16.5 | 13.2 | -1152.6 | 48.0 | 52.0 |
| 366 | Montserrat | MS | 2001.0 | 4239 | 121.11 | -0.07 | 41.6 | 1.20 | 11.1 | 72.1 | 17.5 | 11.8 | 1211.8 | 47.0 | 50.0 |
| 594 | Montserrat | MS | 2002.0 | 4810 | -82.97 | 0.21 | 47.2 | 1.18 | 11.2 | 73.0 | 15.8 | 9.2 | -831.8 | 54.0 | 44.0 |
| 822 | Montserrat | MS | 2003.0 | 4491 | 74.68 | -0.33 | 44.0 | 0.94 | 8.9 | 71.5 | 18.7 | 12.3 | 750.2 | 40.0 | 55.0 |
| 1050 | Montserrat | MS | 2004.0 | 4963 | -48.58 | -0.18 | 48.7 | 0.98 | 9.5 | 73.1 | 15.6 | 11.3 | -484.0 | 47.0 | 56.0 |
| 1278 | Montserrat | MS | 2005.0 | 4530 | 34.08 | 0.09 | 44.4 | 1.45 | 13.9 | 71.1 | 19.7 | 13.0 | 340.0 | 63.0 | 59.0 |
| 1506 | Montserrat | MS | 2006.0 | 4625 | -29.25 | 0.04 | 45.3 | 1.08 | 10.6 | 72.4 | 16.9 | 10.2 | -293.0 | 49.0 | 47.0 |
| 2037 | United Arab Emirates | AE | 2008.0 | 7159278 | 28.50 | 1.04 | 85.6 | 1.81 | 11.7 | 77.2 | 7.8 | 1.2 | 274.5 | 83427.0 | 8793.0 |
Another major aspect of EDA is exploring the correlation between features. Understanding feature correlation, particularly multicollinearity, is a crucial part of data discovery and should be examined regardless of your dataset. Multicollinearity can significantly impact data analysis and predictive modeling, as we will explore shortly. For now, let’s create a correlation matrix to visualize the relationships between our independent features.
In [29]:
# Creating the correlation Matrix
corr_matrix = df.iloc[:,2:].corr()
# Creating a mask for the upper triangle
mask = np.triu(np.ones_like(corr_matrix, dtype=bool),k=1)
# Function to wrap long labels
def wrap_labels(labels, width):
return ['\n'.join(textwrap.wrap(label, width)) for label in labels]
# Wrapping the x and y axis labels to condense the vizualization
wrapped_columns = wrap_labels(corr_matrix.columns, width=20)
# Setting the figure size of the heatmap
plt.figure(figsize=(10, 6))
# Creating the heatmap
sns.heatmap(
corr_matrix, mask=mask, annot=True, cmap='coolwarm', fmt=".2f", square=False,
xticklabels=wrapped_columns, yticklabels=wrapped_columns, cbar=False)
# Setting the plot title
plt.title('Correlation Matrix (Lower Triangle)')
# Changing the tick mark label sizes
plt.xticks(fontsize=9)
plt.yticks(fontsize=9)
# Showing the plot
plt.show()
The correlation matrix shows that we have some strong correlation between several of our independent variables. The strongest correlation present is between Population and Deaths with a 98% correlation. This simply means that as the population of a country increases, so does its number of deaths, which makes perfect sense. The strongest negative correlation present is between Infant Mortality and Life Expectancy with a -92% correlation, meaning the higher the infant mortality rate, the lower the life expectancy of a country.
Let's take a closer look and highlight some of our major correlations to life expectancy.
- Life Expectancy & Year: 24%
- Life Expectancy & Population Density: 24%
- Life Expectancy & Crude Death Rate: -45%
- Life Expectancy & Rate of Natural Increase: -71%
- Life Expectancy & Total Fertility Rate: -80%
- Life Expectancy & Crude Birth Rate: -84%
- Life Expectancy & Infant Mortality Rate: -92%
For the remainder of the Exploratory Data Analysis, we’ll focus on creating various visualizations to explore our data. This section will be mostly code-based, with one exception: a bit more data cleaning to correct country names for mapping purposes. Effective visualizations tell the story of the data, and when done well, they require minimal explanation. We’ll resume commentary when we shift to analyzing our world maps using folium.
In [30]:
# Creating an aggregated dataset of mean values by Country
aggregated_by_country = df.groupby('Name', as_index=False).agg({
'GENC': 'first',
'Year': 'mean',
'Population': 'mean',
'Annual Growth Rate': 'mean',
'Rate of Natural Increase': 'mean',
'Population Density': 'mean',
'Total Fertility Rate': 'mean',
'Crude Birth Rate': 'mean',
'Life Expectancy at Birth, Both Sexes': 'mean',
'Infant Mortality Rate, Both Sexes': 'mean',
'Crude Death Rate': 'mean',
'Net Migration Rate': 'mean',
'Births, Both Sexes': 'mean',
'Deaths, Both Sexes': 'mean'
})
# Showing the first three rows
aggregated_by_country.head(3)
Out[30]:
| Name | GENC | Year | Population | Annual Growth Rate | Rate of Natural Increase | Population Density | Total Fertility Rate | Crude Birth Rate | Life Expectancy at Birth, Both Sexes | Infant Mortality Rate, Both Sexes | Crude Death Rate | Net Migration Rate | Births, Both Sexes | Deaths, Both Sexes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | AF | 2012.0 | 35682196.00 | 2.8872 | 3.1944 | 54.712 | 6.0984 | 40.204 | 61.660 | 60.688 | 8.252 | -3.068 | 1413325.08 | 284472.72 |
| 1 | Albania | AL | 2012.0 | 2843029.20 | -0.8524 | 0.4932 | 103.764 | 1.5200 | 12.188 | 76.076 | 15.592 | 7.256 | -13.448 | 34938.48 | 20419.80 |
| 2 | Algeria | DZ | 2012.0 | 38002869.36 | 1.7740 | 1.8664 | 15.948 | 2.8040 | 23.008 | 75.660 | 25.760 | 4.352 | -0.944 | 877749.52 | 165181.36 |
In [31]:
# Using adjust text to add seperation to plot labels
from adjustText import adjust_text
# Getting the top 40 countries with the highest mean births
country_birth_rates = aggregated_by_country.nlargest(40, 'Births, Both Sexes')
# Setting the figure size
plt.figure(figsize=(12, 7))
# Creating the scatter plot
sns.scatterplot(data=aggregated_by_country, x='Total Fertility Rate', y='Births, Both Sexes')
# Creating a list to hold the top 40 country birth labels
texts = []
# Collecting the country names in the list
for _, row in country_birth_rates.iterrows():
texts.append(plt.text(x=row['Total Fertility Rate'], y=row['Births, Both Sexes'], s=row['Name'], fontsize=10))
# Writing the text on the plot (Avoids overlapping text)
adjust_text(texts, arrowprops=dict(arrowstyle='-', color='gray', lw=0.5))
# Preventing the y-axis from formatting in scientific notation (previously written funcion)
plt.gca().yaxis.set_major_formatter(FuncFormatter(format_number))
# Setting the plot title and subtitle
plt.suptitle('Mean Fertitlity Rate by Mean Births from 2000-2024', fontsize=16)
plt.title("Top 40 countries by birth rate labeled")
# Setting the axis labels
plt.xlabel('# Children Expected per Woman', fontsize=12)
plt.ylabel('Mean Live Births per Year', fontsize=12)
# Showing the plot
plt.show()
In [32]:
# China and India's data
china_data = df[df['Name'] == 'China']
india_data = df[df['Name'] == 'India']
# Plotting country births
plt.figure(figsize=(12, 6))
sns.lineplot(data=china_data, x='Year', y='Births, Both Sexes', label='China', color='blue')
sns.lineplot(data=india_data, x='Year', y='Births, Both Sexes', label='India', color='orange')
# Adding titles and adjusting the plot
plt.suptitle('Births by Year', fontsize=16,x=.5)
plt.title('China and India', fontsize=12,x=.475)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Births (Both Sexes)', fontsize=12)
plt.legend(title='Country', fontsize=10, title_fontsize=12)
plt.grid(True)
plt.tight_layout()
# Using the previously written function to format the y axis
plt.gca().yaxis.set_major_formatter(FuncFormatter(format_number))
# Placing tick marks every year
plt.xticks(ticks=range(2000, 2025), rotation=45)
# displaying the plot
plt.show()
In [33]:
# Getting agg world population
world_population_by_year = df.groupby('Year')['Births, Both Sexes'].sum().reset_index()
# Plotting country data
plt.figure(figsize=(12, 6))
sns.lineplot(data=world_population_by_year, x='Year', y='Births, Both Sexes', label='World Births', color='blue')
# Designing the plot
plt.suptitle('World Births by Year', fontsize=16,x=.5)
plt.xlabel('Year', fontsize=12)
plt.ylabel('# Births', fontsize=12)
plt.grid(True)
plt.tight_layout()
# Using the previously written function to format the y axis
plt.gca().yaxis.set_major_formatter(FuncFormatter(format_number))
# Setting tick marks to be every year
plt.xticks(ticks=range(2000, 2025), rotation=45)
# Show the plot
plt.show()
In [34]:
# Using the "df_all_countries" dataframe to include the USA and other missing countries
# that had filled population data but were missing other features.
# Getting Chinas mean Population from 2000-2024
pop_china = df_all_countries[df_all_countries['Name']== 'China'].Population.mean()
# Getting Indias mean Population from 2000-2024
pop_india = df_all_countries[df_all_countries['Name']== 'India'].Population.mean()
# Combining china and india population totals
pop_china_india = pop_china + pop_india
# Getting total mean population from all countries
total_pop = df_all_countries.groupby('Year')['Population'].sum().mean()
# Subtracting china and indias population from the world mean
pop_rest_of_world = total_pop - pop_china_india
# Setting the labels for the Pie Chart
labels = ['China & India Combined', 'Rest of the World']
# Defining the size of the wedges
sizes = [pop_china_india, pop_rest_of_world]
# Setting the colors of the pie plot
colors = ['skyblue', 'lightcoral']
# Setting the figure size of the pie chart
plt.figure(figsize=(5, 5))
# Plotting the pie chart
plt.pie(sizes, labels=labels, autopct='%1.1f%%', colors=colors, startangle=140)
# Setting a title for the plot
plt.suptitle('Mean Population', fontsize=16)
plt.title("2000-2024")
plt.show()
In [35]:
# Getting 2024 values for China & India vs the rest of the world
china_pop_2024 = df_all_countries[(df_all_countries['Name'] == 'China') & (df_all_countries['Year'] == 2024)].reset_index().Population[0]
india_pop_2024 = df_all_countries[(df_all_countries['Name'] == 'India') & (df_all_countries['Year'] == 2024)].reset_index().Population[0]
total_pop_2024 = df_all_countries[df_all_countries['Year']==2024].Population.sum()
pop_2024_china_india = china_pop_2024 + india_pop_2024
pop_2024_remain = total_pop_2024 - pop_2024_china_india
In [36]:
# Importing Image from PIL to handle the background globe image
from PIL import Image
# Importing requests to get the image of the globe
import urllib.request
# Setting the pie chart labels
labels = ['China & India', 'Rest of the World']
# Setting the pie chart dimensions
sizes = [pop_2024_china_india, pop_2024_remain]
# Setting the colors of the pie plot
colors = ['skyblue', 'lightcoral']
# Loading the globe image from the URL using Pillow
url = 'https://www.freepnglogos.com/uploads/globe-png/file-blank-globe-svg-wikimedia-commons-40.png'
with urllib.request.urlopen(url) as f:
globe_image = Image.open(f)
globe_image = np.array(globe_image)
# Setting the size of the vizualization
fig, ax = plt.subplots(figsize=(12, 5))
# Setting the background color to black
fig.patch.set_facecolor('black')
# Adding the globe to be the background of the pie chart
ax.imshow(globe_image, extent=[-1, 1, -1, 1], alpha=1, zorder=0)
# Creating the pie chart, setting the alpha level for transparency, and changing the font color to white
wedges, texts, autotexts = ax.pie(
sizes,
labels=labels,
autopct='%1.1f%%',
colors=colors,
startangle=300,
textprops={'fontsize': 18, 'color': 'white'},
wedgeprops={'zorder': 1, 'alpha': 0.7})
# Formatting the total population with commas
formatted_total_population = f"{total_pop_2024:,.0f}"
# Adding a title and sub title to the plot
plt.suptitle('World Population: 2024', fontsize=20, color='white')
plt.title(f'Total Population: {formatted_total_population}', color='white')
# Removing the axis for a cleaner look
ax.axis('equal')
ax.axis('off')
# Showing the plot with a tight layout
plt.tight_layout()
plt.show()
In [37]:
# Importing adjustText to help with plot labels
from adjustText import adjust_text
# Getting the top 40 countries with the highest crude birth rates
country_death_rates = aggregated_by_country.nlargest(40, 'Crude Birth Rate')
# Setting the figure size
plt.figure(figsize=(12, 7))
# Creating the scatter plot
sns.regplot(data=aggregated_by_country, x='Total Fertility Rate', y='Crude Birth Rate')
# Creating a list to hold the top 20 country labels
texts = []
# Collecting the country names in the list
for _, row in country_death_rates.iterrows():
texts.append(plt.text(x=row['Total Fertility Rate'], y=row['Crude Birth Rate'], s=row['Name'], fontsize=10))
# Writing the text on the plot (Avoids overlapping text)
adjust_text(texts, arrowprops=dict(arrowstyle='-', color='gray', lw=0.5))
# Preventing the y-axis from formatting in scientific notation (previously written funcion)
plt.gca().yaxis.set_major_formatter(FuncFormatter(format_number))
# Setting the plot title and subtitle
plt.suptitle('Mean Fertitlity Rate by Crude Birth Rate from 2000-2024', fontsize=16)
plt.title("Top 40 countries by crude birth rate labeled")
# Setting the axis labels
plt.xlabel('# Children Expected per Woman', fontsize=12)
plt.ylabel('Crude Birth Rate (per 1,000)', fontsize=12)
# Showing the plot
plt.show()
For the data visualizations using world maps, we are going to utilize the Folium library. The Folium library is an extensive graphing library that handles interactive maps extremely well. For more information about Folium maps and getting started, see the following link:
In [38]:
import folium
from folium import Choropleth
from IPython.display import display, IFrame
# Creating a folium map
world_map = folium.Map(location=[0, 0], zoom_start=2)
# Creating the choropleth layer
Choropleth(
geo_data='https://raw.githubusercontent.com/johan/world.geo.json/master/countries.geo.json',
data=aggregated_by_country,
columns=['Name', 'Crude Death Rate'],
key_on='feature.properties.name',
fill_color='Reds',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Mean Crude Death Rate'
).add_to(world_map)
# Saving the world map
world_map.save("world_map_death_1.html")
Map of the Mean Crude Death Rate
In [39]:
# Change the src to the github repository location
#<iframe src="maps/world_map_death_1.html" width="65%" height="600"></iframe>
# Displaying the map using IFrame
display(IFrame("world_map_death_1.html", width="65%", height="600"))
If we examine the map we just created, we notice some countries, like South Korea, appear to have no data despite data being available. This issue arises due to naming inconsistencies between our dataset’s country names and those in the geo.json file used to populate the Folium map. The next two code chunks will identify these discrepancies and resolve the mismatched country names to ensure accurate mapping.
In [40]:
import requests
# Loading the geojson file
geojson_url = 'https://raw.githubusercontent.com/johan/world.geo.json/master/countries.geo.json'
geojson_data = requests.get(geojson_url).json()
# Getting the geojson country names
geojson_countries = [feature['properties']['name'] for feature in geojson_data['features']]
# Getting our datasets country names
dataset_countries = df['Name'].unique()
# Finding the mismatched countries
non_matching_countries = [country for country in dataset_countries if country not in geojson_countries]
In [41]:
# Correcting all of the country mismatches
name_corrections = {
'American Samoa': 'Samoa',
'Bahamas, The': 'Bahamas',
'Burma': 'Myanmar',
'Cabo Verde': 'Cape Verde',
'Congo (Brazzaville)': 'Republic of the Congo',
'Congo (Kinshasa)': 'Democratic Republic of the Congo',
'Curaçao': 'Curacao',
'Czechia': 'Czech Republic',
"Côte d’Ivoire": "Ivory Coast",
'Eswatini': 'Swaziland',
'Gambia, The': 'Gambia',
'Gaza Strip': 'Palestinian Territory',
'Guinea-Bissau': 'Guinea Bissau',
'Guiana':'French Guiana',
'Korea, North': 'North Korea',
'Korea, South': 'South Korea',
'Macau': 'Macao',
'Micronesia, Federated States of': 'Micronesia',
'North Macedonia': 'Macedonia',
'Saint Barthelemy': 'Saint Barthelemy',
'Saint Helena, Ascension, and Tristan da Cunha': 'Saint Helena',
'Saint Kitts and Nevis': 'Saint Kitts and Nevis',
'Saint Lucia': 'Saint Lucia',
'Saint Martin': 'Saint Martin',
'Saint Pierre and Miquelon': 'Saint Pierre and Miquelon',
'Saint Vincent and the Grenadines': 'Saint Vincent',
'Serbia':'Republic of Serbia',
'Tanzania':'United Republic of Tanzania',
'Timor-Leste': 'Timor Leste',
'Virgin Islands, British': 'British Virgin Islands',
'Virgin Islands, U.S.': 'United States Virgin Islands'}
# Applying the corrections
df['Name'] = df['Name'].replace(name_corrections)
In [42]:
# Recreating the aggregated dataset of mean values by country
aggregated_by_country = df.groupby('Name', as_index=False).agg({
'GENC': 'first',
'Year': 'mean',
'Population': 'mean',
'Annual Growth Rate': 'mean',
'Rate of Natural Increase': 'mean',
'Population Density': 'mean',
'Total Fertility Rate': 'mean',
'Crude Birth Rate': 'mean',
'Life Expectancy at Birth, Both Sexes': 'mean',
'Infant Mortality Rate, Both Sexes': 'mean',
'Crude Death Rate': 'mean',
'Net Migration Rate': 'mean',
'Births, Both Sexes': 'mean',
'Deaths, Both Sexes': 'mean'})
In [43]:
# Creating a folium map
world_map = folium.Map(location=[0, 0], zoom_start=2,tiles="OpenStreetMap")
# Creating the choropleth layer
Choropleth(
geo_data='https://raw.githubusercontent.com/johan/world.geo.json/master/countries.geo.json',
data=aggregated_by_country,
columns=['Name', 'Crude Death Rate'],
key_on='feature.properties.name',
fill_color='Reds',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Mean Crude Death Rate'
).add_to(world_map)
# Saving the world map
world_map.save("world_map_death_2.html")
Map of the Mean Crude Death Rate (Fixed)
In [44]:
# Change the src to the github repository location
#<iframe src="maps/world_map_death_2.html" width="65%" height="600"></iframe>
# Displaying the map using IFrame
display(IFrame("world_map_death_2.html", width="65%", height="600"))
This looks much better! While it’s unfortunate that we don’t have data for every country, we still have a solid representation of the world. There are a few areas on the map displaying oddly, such as Somaliland, a territory in Somalia which is unrecognized globally and appears masked out, but overall, the visualization works well.
Life Expectancy is the main focus of our analysis and is the feature that we are most interested in. We have explored several of the features up to this point, which is fundamental to understanding our data, but now its time to focus on life expectancy.
In [45]:
# Importing adjustText to help with plot labels
from adjustText import adjust_text
# Getting the top 20 countries with the highest crude death rate
country_death_rates = aggregated_by_country.nlargest(20, 'Life Expectancy at Birth, Both Sexes')
# Setting the figure size
plt.figure(figsize=(12, 7))
# Creating the scatter plot
sns.regplot(
data=aggregated_by_country,
y='Life Expectancy at Birth, Both Sexes',
x='Crude Death Rate')
# Creating a list to hold the top 20 country birth labels
texts = []
# Collecting the country names in the list
for _, row in country_death_rates.iterrows():
texts.append(plt.text(x=row['Crude Death Rate'], y=row['Life Expectancy at Birth, Both Sexes'], s=row['Name'], fontsize=10))
# Writing the text on the plot (Avoids overlapping text)
adjust_text(texts, arrowprops=dict(arrowstyle='-', color='gray', lw=0.5))
# Preventing the y-axis from formatting in scientific notation (previously written funcion)
plt.gca().yaxis.set_major_formatter(FuncFormatter(format_number))
# Setting the plot title and subtitle
plt.suptitle('Mean Crude Death Rate by Life Expectancy from 2000-2024', fontsize=16)
plt.title("Top 20 countries by mean life expectancy labeled")
# Setting the axis labels
plt.xlabel('Crude Death Rate (per 1,000)', fontsize=12)
plt.ylabel('Mean Life Expectancy', fontsize=12)
# Showing the plot
plt.show()
The plot above displays the mean crude death rate against the mean life expectancy. As highlighted in our earlier correlation matrix, we already knew these two features were correlated, but now we can visualize this relationship. Additionally, we’ve enhanced the graphic’s informational value by highlighting the top 20 countries with the highest mean life expectancy.
In [46]:
# Importing adjustText to help with plot labels
from adjustText import adjust_text
# Getting the top 20 countries with the highest Infant Mortality Rate
country_death_rates = aggregated_by_country.nlargest(30, 'Infant Mortality Rate, Both Sexes')
# Setting the figure size
plt.figure(figsize=(12, 7))
# Creating the scatter plot
sns.regplot(data=aggregated_by_country, x='Life Expectancy at Birth, Both Sexes', y='Infant Mortality Rate, Both Sexes')
# Creating a list to hold the top 20 country labels
texts = []
# Collecting the country names in the list
for _, row in country_death_rates.iterrows():
texts.append(plt.text(
y=row['Infant Mortality Rate, Both Sexes'],
x=row['Life Expectancy at Birth, Both Sexes'],
s=row['Name'],
fontsize=10))
# Writing the text on the plot (Avoids overlapping text)
adjust_text(texts, arrowprops=dict(arrowstyle='-', color='gray', lw=0.5))
# Preventing the y-axis from formatting in scientific notation (previously written funcion)
plt.gca().yaxis.set_major_formatter(FuncFormatter(format_number))
# Setting the plot title and subtitle
plt.suptitle('MeanLife Expectancy by Infant Mortality Rate from 2000-2024', fontsize=16)
plt.title("Top 30 countries by mean Infant Mortality Rate, Both Sexes labeled")
# Setting the axis labels
plt.ylabel('Infant Mortality Rate (per 1,000 births)', fontsize=12)
plt.xlabel('Mean Life Expectancy', fontsize=12)
# Showing the plot
plt.show()
The graph above illustrates the relationship between mean life expectancy and infant mortality, which shows the strongest negative correlation in our data at -92%. It is clear that countries with high infant mortality rates tend to have significantly lower life expectancies compared to those with low infant mortality rates. Another striking pattern is the concentration of countries with high infant mortality rates, predominantly in Africa. When this data is visualized on a map (below), the African continent definetly stands out as having the highest infant mortality rates in the world.
In [47]:
# Creating a folium map
world_map = folium.Map(location=[0, 0], zoom_start=2,tiles="OpenStreetMap")
# Creating the choropleth layer
Choropleth(
geo_data='https://raw.githubusercontent.com/johan/world.geo.json/master/countries.geo.json',
data=aggregated_by_country,
columns=['Name', 'Infant Mortality Rate, Both Sexes'],
key_on='feature.properties.name',
fill_color='Reds',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Infant Mortality Rate, Both Sexes'
).add_to(world_map)
# Saving the world map
world_map.save("world_map_inf_mort.html")
Map of Infant Mortality Rate
In [48]:
# Change the src to the github repository location
#<iframe src="maps/world_map_inf_mort.html" width="65%" height="600"></iframe>
# Displaying the map using IFrame
display(IFrame("world_map_inf_mort.html", width="65%", height="600"))
Now let’s examine life expectancy across all countries for each year in a single visualization. This includes data from 220 countries for a total of 5,525 life expectancy values. Visualizing a lot of data points can often result in cluttered or overwhelming graphs. However, since we’re focusing on life expectancy itself and not the specific countries (as shown above), a violin plot is an ideal choice. The violin plot consolidates all data points into a streamlined graph, effectively showing the variance or skew of life expectancy per year. This makes it perfect for identifying underlying trends in the data without overlooking any individual data points.
In [49]:
# Setting the figure size
plt.figure(figsize=(12, 6))
# Creating the violin plot
sns.violinplot(data=df, x=df['Year'].astype(int), y='Life Expectancy at Birth, Both Sexes')
# Adding titles and labels to the plot
plt.title('Distribution of Life Expectancy by Year', fontsize=16)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Life Expectancy (Both Sexes)', fontsize=12)
# Rotating the labels
plt.xticks(rotation=45)
# Displaying the plot
plt.tight_layout()
plt.show()
print(f"2000's Mean Life Expectancy:{round((df['Life Expectancy at Birth, Both Sexes'][df['Year']==2000].mean()),2)}")
print(f"2024's Mean Life Expectancy:{round((df['Life Expectancy at Birth, Both Sexes'][df['Year']==2024].mean()),2)}")
2000's Mean Life Expectancy:68.39 2024's Mean Life Expectancy:75.1
The white lines on the violin plot represent the median life expectancy for each year, showing a subtle upward trend over time. The vertical range of the violins illustrates the distribution and skew of life expectancy in each year, covering all data points. Overall, the data appears consistent, with the obvious exceptions of 2010 and 2011. Taking a closer look at these outliers, below, we can see that Haiti and Somalia had very low life expectancies during 2010 and 2011 respectively. In 2010 Haiti faced a devastating 7.0 earthquake that was catastrophic for the country, and Somalia experienced serious famine and civil war in 2011. More about these events can be read at the following two links:
Britannica, written by Richard Pallardy
International Rescue Committee
In [50]:
# Haiti had a massive 7.0 earthquake in 2010 that led to 220,000 - 300,000 deaths and created ongoing issues
# Somolia experienced a great famine and civil war during 2011
print(df[df['Year'] == 2010].loc[df[df['Year'] == 2010]['Life Expectancy at Birth, Both Sexes'].idxmin()])
print()
print(df[df['Year'] == 2011].loc[df[df['Year'] == 2011]['Life Expectancy at Birth, Both Sexes'].idxmin()])
Name Haiti GENC HT Year 2010.0 Population 9569827 Annual Growth Rate -1.03 Rate of Natural Increase -0.54 Population Density 347.2 Total Fertility Rate 3.5 Crude Birth Rate 27.5 Life Expectancy at Birth, Both Sexes 29.6 Infant Mortality Rate, Both Sexes 78.4 Crude Death Rate 32.9 Net Migration Rate -4.8 Births, Both Sexes 263209.0 Deaths, Both Sexes 315219.0 Name: 2368, dtype: object Name Somalia GENC SO Year 2011.0 Population 12531295 Annual Growth Rate -0.75 Rate of Natural Increase 1.88 Population Density 20.0 Total Fertility Rate 7.13 Crude Birth Rate 52.2 Life Expectancy at Birth, Both Sexes 31.4 Infant Mortality Rate, Both Sexes 197.8 Crude Death Rate 33.4 Net Migration Rate -26.3 Births, Both Sexes 653797.0 Deaths, Both Sexes 418497.0 Name: 2696, dtype: object
In [51]:
# Creating a folium map
world_map = folium.Map(location=[0, 0], zoom_start=2,tiles="OpenStreetMap")
# Creating the choropleth layer
Choropleth(
geo_data='https://raw.githubusercontent.com/johan/world.geo.json/master/countries.geo.json', # GeoJSON data
data=aggregated_by_country,
columns=['Name', 'Life Expectancy at Birth, Both Sexes'], # Match on country names
key_on='feature.properties.name', # Use 'name' field from GeoJSON
fill_color='RdYlGn',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Life Expectancy at Birth, Both Sexes'
).add_to(world_map)
# Saving the world map
world_map.save("world_map_life_expectancy.html")
Map of Life Expectancy at Birth
In [52]:
# Change the src to the github repository location
#<iframe src="maps/world_map_inf_mort.html" width="65%" height="600"></iframe>
# Displaying the map using IFrame
display(IFrame("world_map_life_expectancy.html", width="65%", height="600"))
The life expectancy at birth map is pretty straight forward, but we do clearly see that the African continent has the lowest average life expectancy at birth. This is to be expected since the correlation between life expectancy and infant mortalities were extremely negative, and the African continent stood out as having high infant mortality rates.
That covers our Exploratory Data Analysis for this project. Next we are going into the next stage of the pipeline, hypothesis testing and machine learning models. If you would like some more information on EDA, I have provided some resources to progress your knowledge on the topic further.
Simpli Learn, Written by Avijeet Biswal
Geeks for Geeks
Analytics Vidhya, Written by malamahadevan
In this section of the tutorial, we will build statistical models to further uncover underlying trends that influence human life expectancy. Using the best-performing models, we will also predict future life expectancy into the future.
In a real-world scenario, you won't always know with certainty what the best possible model will be. You can take an educated guess at what the best model might be, but you won't always be correct. It's always best to experiment with different models and hyperparameters to determine from a statistical perspective the best model. For this tutorial, we’ll start with a basic multivariate linear regression model and progressively explore more complex models, testing different variations of input data along the way. Let’s get started!
In [53]:
# Multivariate linear regression for exploratory model building
# Library and function imports
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
# Removing the categorical data and target variable from our input Matrix
X = df.drop(["Life Expectancy at Birth, Both Sexes",'GENC','Name'], axis=1)
# Adding a bias column of ones to capture the intercepts weights
X = sm.add_constant(X)
# Setting our target feature variable
y = df["Life Expectancy at Birth, Both Sexes"]
# Splitting 2/3 of the data into training data and 1/3 testing data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=37)
We can examine the distribution of our target variable for both the training and testing sets. While a Gaussian distribution for the target variable isn’t a strict requirement for a linear regression model, it’s always useful to check. In our case, the distribution is approximately Gaussian but shows a left skew, with the tail extending to the left. However, this skew won’t impact the input to our models, and we can move forward with linear regression.
In [54]:
# Create a 2x2 grid of subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Creating a histplot of the training target variable
sns.histplot(y_train, bins=50, kde=True, color="blue", edgecolor="black", ax=axes[0])
axes[0].set_xlabel("Life Expectancy", fontsize=14)
axes[0].set_ylabel("Frequency", fontsize=14)
axes[0].set_title("Training Set", fontsize=16)
# Creating a histplot of the testing target variable
sns.histplot(y_test, bins=50, kde=True, color="blue", edgecolor="black", ax=axes[1])
axes[1].set_xlabel("Life Expectancy", fontsize=14)
axes[1].set_ylabel("Frequency", fontsize=14)
axes[1].set_title("Testing Set", fontsize=16)
# Creating a suptitle
fig.suptitle("Distribution of Life Expectancy", fontsize=18, y=1.02)
# Displaying the plot
plt.tight_layout()
plt.show()
We can fit a very basic linear model and check the results by using model.summary().
This basic linear model, as seen below, performs extremely well with an R-squared score of 0.916. This means that our basic linear model is explaining 91.6% of the variability in our target variable, life expectancy. That is a very good result, but we must remember that this is just the training data and we still must check this model with our unseen testing data.
We can see our models final learned weights in the coef column, and it looks like this model is predicting that for every 1 increase in year, life expectancy is increasing by 0.0538 years. Moreover, we can see that "Year", "Population", "Population Density", "Total Fertility Rate", "Infant Mortality Rate, Both Sexes", and "Deaths, Both Sexes" all have p-values less than 0.05, which means they are statistically significant in explaining the variability in the target variable. However, as note 2 states at the bottom of the model summary, we have strong multicollinearity between the features that could be affecting these results. This multicollinearity is a result of the correlation between our independent variables, which we haven't experimented with removing yet.
This basic linear model, as seen below, performs extremely well with an R-squared score of 0.916. This means that our basic linear model is explaining 91.6% of the variability in our target variable, life expectancy. That is a very good result, but we must remember that this is just the training data and we still must check this model with our unseen testing data.
We can see our models final learned weights in the coef column, and it looks like this model is predicting that for every 1 increase in year, life expectancy is increasing by 0.0538 years. Moreover, we can see that "Year", "Population", "Population Density", "Total Fertility Rate", "Infant Mortality Rate, Both Sexes", and "Deaths, Both Sexes" all have p-values less than 0.05, which means they are statistically significant in explaining the variability in the target variable. However, as note 2 states at the bottom of the model summary, we have strong multicollinearity between the features that could be affecting these results. This multicollinearity is a result of the correlation between our independent variables, which we haven't experimented with removing yet.
In [55]:
# Fitting the model
basic_model = sm.OLS(y_train, X_train).fit()
In [56]:
# Retrieving the models results
print(basic_model.summary())
OLS Regression Results
================================================================================================
Dep. Variable: Life Expectancy at Birth, Both Sexes R-squared: 0.916
Model: OLS Adj. R-squared: 0.916
Method: Least Squares F-statistic: 3370.
Date: Tue, 10 Dec 2024 Prob (F-statistic): 0.00
Time: 20:01:51 Log-Likelihood: -8463.9
No. Observations: 3701 AIC: 1.695e+04
Df Residuals: 3688 BIC: 1.703e+04
Df Model: 12
Covariance Type: nonrobust
=====================================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------------
const -21.4493 11.281 -1.901 0.057 -43.568 0.669
Year 0.0538 0.006 9.618 0.000 0.043 0.065
Population 5.61e-09 2.12e-09 2.649 0.008 1.46e-09 9.76e-09
Annual Growth Rate -6.6250 8.373 -0.791 0.429 -23.041 9.791
Rate of Natural Increase 3.8891 9.654 0.403 0.687 -15.039 22.817
Population Density 0.0003 2.17e-05 16.058 0.000 0.000 0.000
Total Fertility Rate 5.2422 0.148 35.356 0.000 4.952 5.533
Crude Birth Rate -0.7108 0.841 -0.845 0.398 -2.360 0.939
Infant Mortality Rate, Both Sexes -0.1503 0.004 -38.304 0.000 -0.158 -0.143
Crude Death Rate -0.9503 0.842 -1.129 0.259 -2.601 0.700
Net Migration Rate 0.6640 0.837 0.793 0.428 -0.977 2.306
Births, Both Sexes 6.39e-08 5.81e-08 1.099 0.272 -5e-08 1.78e-07
Deaths, Both Sexes -1e-06 2.69e-07 -3.713 0.000 -1.53e-06 -4.72e-07
==============================================================================
Omnibus: 93.825 Durbin-Watson: 1.959
Prob(Omnibus): 0.000 Jarque-Bera (JB): 124.304
Skew: -0.299 Prob(JB): 1.02e-27
Kurtosis: 3.670 Cond. No. 3.81e+10
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.81e+10. This might indicate that there are
strong multicollinearity or other numerical problems.
Let’s now check the assumptions of our linear model. Linear regression models rely on four key assumptions. For this tutorial, I’ve used the assumption definitions provided by Zach Bobbitt on the website Statology. Below, you’ll find the four linear regression assumptions, along with a link to Zach’s article, which provides a more in-depth explanation.
- Linear relationship:
There exists a linear relationship between the independent variable, x, and the dependent variable, y. - Normality:
The residuals of the model are normally distributed. - Independence:
The residuals are independent. In particular, there is no correlation between consecutive residuals in time series data. - Homoscedasticity:
The residuals have constant variance at every level of x.
Statology, Written by Zach Bobbitt
In [57]:
# Getting the models residuals
residuals = basic_model.resid
# setting the figure size
plt.figure(figsize=(12, 6))
# Creating a histplot
sns.histplot(residuals, bins=50, kde=True, color="blue", edgecolor="black")
# Setting title and labels
plt.title("Histogram of Target Variable (Life Expectancy)", fontsize=16)
plt.xlabel("Life Expectancy", fontsize=14)
plt.ylabel("Frequency", fontsize=14)
#displaying the plot
plt.show()
The histogram above shows the distribution of our residuals in a well-defined Gaussian curve centered around 0. Although it may not be perfect, this bell-shaped distribution is exactly what we are looking for to satisfy the assumptions of normally distributed residuals.
The next code chunk and the resulting graphs will check all four assumptions of a linear model, replicating the output typically seen in the R programming language when plotting a model. This quad-graph output provides an efficient and comprehensive way to assess the assumptions of a linear model.
In [58]:
# Creating the subplot grid
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Residuals vs Fitted Plot
resid = basic_model.resid
fitted = basic_model.fittedvalues
lowess = sm.nonparametric.lowess(resid, fitted, frac=0.3)
axes[0, 0].scatter(fitted, resid, alpha=0.7)
axes[0, 0].plot(lowess[:, 0], lowess[:, 1], color='red', linestyle='--', linewidth=1.5)
axes[0, 0].axhline(0, color='black', linestyle='--')
axes[0, 0].set_xlabel('Fitted Values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted')
# Q-Q Plot
sm.qqplot(basic_model.resid, line='45', fit=True, ax=axes[0, 1])
axes[0, 1].set_title('Q-Q Plot')
# Scale-Location Plot (Standardized Residuals vs Fitted Values)
standardized_residuals = basic_model.resid_pearson
axes[1, 0].scatter(basic_model.fittedvalues, standardized_residuals, alpha=0.7)
axes[1, 0].axhline(0, color='red', linestyle='--')
axes[1, 0].set_xlabel('Fitted Values')
axes[1, 0].set_ylabel('Standardized Residuals')
axes[1, 0].set_title('Scale-Location Plot')
# Leverage vs Residuals (Cook's Distance)
sm.graphics.influence_plot(basic_model, criterion="cooks", ax=axes[1, 1], size=5)
axes[1, 1].set_title("Leverage vs Residuals")
axes[1, 1].set_xlabel('Fitted Values', size = 11)
axes[1, 1].set_ylabel('Standardized Residuals', size = 11)
# Addding cooks lines
n = basic_model.nobs
p = basic_model.df_model + 1
x = np.linspace(0.001, max(basic_model.get_influence().hat_matrix_diag) * 1.1, 100)
for d in [0.5, 1]:
y = np.sqrt(d * (p * (1 - x) / x))
axes[1, 1].plot(x, y, color='red', linestyle='--', lw=1)
axes[1, 1].plot(x, -y, color='red', linestyle='--', lw=1)
# Setting the plot range
# NOTE: I am manually adjusting these limits. Take caution when doing this yourself.
# Start with no limits and then zoom in until you find an appropriate fit.
axes[1, 1].set_xlim(0, 0.5)
axes[1, 1].set_ylim(-10, 10)
# displaying the plot
plt.tight_layout()
plt.show()
Let's go over the assumptions of our linear model as shown in the graph above, starting with the top left and ending with the bottom right.
- Residuals vs Fitted:
This plot checks two assumptions: linearity and independence. It helps determine whether there is a linear relationship between the response variable and the independent features (predictors) and whether the residuals are independent of each other. Ideally, we expect no visible patterns in this plot, with residuals equally spread around the black dashed line. Additionally, the red dashed line should be as straight as the black dashed line, indicating the absence of obvious patterns or relationships. In this case, it appears that the basic model fails to meet these assumptions of the linear regression model. - Q-Q plot:
This plot shows us if the residuals of the model are normally distributed. We expect them to closely follow the red line, which indicates perfect residuals. This is the same check we did above with the histplot when we checked for a Gaussian distribution. This passes the assumption. - Scale-Location Plot:
The scale location plot checks for homoscedasticity, equal variance of the residuals. It looks very similar to the Residuals vs Fitted plot, but here we are check for an equal spread of points around zero. This plot is currently showing signs of clumping, but it appears to be primarily due to one outlier on the far left of the plot. I wouldn't call this assumption an instant fail, but it could be better. - Leverage vs Residuals:
This plot is used to analyze Cook's distance, a metric that identifies outliers strongly affecting the model. If any points fall outside of the red lines, then they are greatly affecting the models results. The point labeled 138 (df row 138) is close to the line, but is still within. Everything passes here.
Check out this article by Bommae Kim on the University of Virginia Library website for more details about these assumption plots.
Although it failed the assumptions of the linear model, let's check the test data.
In [59]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
# setting the figure size
plt.figure(figsize=(12, 6))
# Making new predictions from our feature test matrix
y_pred = basic_model.predict(X_test)
# Calculating the RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
# Calculating R^2 (Goodness of fit)
r2 = r2_score(y_test, y_pred)
# Plotting the results
plt.scatter(y_test, y_pred, alpha=0.7)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], '--r', linewidth=2)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.suptitle('Actual vs Predicted')
plt.title(f'RMSE = {rmse:.3f} || R² = {r2:.3f}')
plt.show()
The test data actually performed very well, with an R^2 of 91.5% variability explained, even with one failed assumption. It appears even though the residuals were displaying signs of non-indpendence, the model still perfoms well on unseen data. The failed assumption would be enough for me to seek out alternative models, however.
The next model we are going to build will test interactions between the predictor variables. This will enable us to capture relationships between the features that otherwise may not be captured.
The Polynomial Regression Model captures higher-order terms by creating interactions between variables. You can control the complexity of the model by tuning the “degree” parameter in PolynomialFeatures(). Increasing the degree adds more complexity and allows the model to fit more intricate patterns in the data. The workflow for this model will follow the same process as the linear regression example above, so review the code to learn more about how polynomial regression works.
In [60]:
from sklearn.preprocessing import PolynomialFeatures
# fitting Polynomial Features (degree=2 interactions) (bias already included in X)
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
# Fitting the interactions
X_interaction = poly.fit_transform(X_train)
# getting the feature names for interaction terms
interaction_feature_names = poly.get_feature_names_out(X_train.columns)
# convervting to a dataframe
X_interaction_df = pd.DataFrame(X_interaction, columns=interaction_feature_names, index=X_train.index)
# Fiting the linear regression model with interaction terms
poly_model = sm.OLS(y_train,X_interaction_df).fit()
# Model summary
print(poly_model.summary())
OLS Regression Results
================================================================================================
Dep. Variable: Life Expectancy at Birth, Both Sexes R-squared: 0.951
Model: OLS Adj. R-squared: 0.950
Method: Least Squares F-statistic: 1434.
Date: Tue, 10 Dec 2024 Prob (F-statistic): 0.00
Time: 20:02:13 Log-Likelihood: -7491.1
No. Observations: 3701 AIC: 1.508e+04
Df Residuals: 3651 BIC: 1.539e+04
Df Model: 49
Covariance Type: nonrobust
==============================================================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------------------------------------------
const -3.016e-05 8.57e-08 -352.004 0.000 -3.03e-05 -3e-05
Year 0.0222 8.9e-05 249.149 0.000 0.022 0.022
Population 3.425e-06 5.17e-07 6.627 0.000 2.41e-06 4.44e-06
Annual Growth Rate 1.732e-06 2.11e-06 0.822 0.411 -2.4e-06 5.86e-06
Rate of Natural Increase 1.827e-06 2.27e-06 0.806 0.420 -2.62e-06 6.27e-06
Population Density -0.0019 0.003 -0.730 0.465 -0.007 0.003
Total Fertility Rate 4.847e-07 2.83e-07 1.714 0.087 -6.97e-08 1.04e-06
Crude Birth Rate 1.278e-05 2.57e-06 4.983 0.000 7.75e-06 1.78e-05
Infant Mortality Rate, Both Sexes 2.119e-05 6.29e-06 3.366 0.001 8.85e-06 3.35e-05
Crude Death Rate -3.253e-06 6.72e-07 -4.843 0.000 -4.57e-06 -1.94e-06
Net Migration Rate -4.994e-06 1.38e-06 -3.618 0.000 -7.7e-06 -2.29e-06
Births, Both Sexes -4.827e-05 1.61e-05 -2.998 0.003 -7.98e-05 -1.67e-05
Deaths, Both Sexes 0.0003 5.54e-05 5.787 0.000 0.000 0.000
const Year 0.0222 8.9e-05 249.188 0.000 0.022 0.022
const Population -5.693e-06 6.47e-07 -8.796 0.000 -6.96e-06 -4.42e-06
const Annual Growth Rate 1.722e-06 2.11e-06 0.818 0.414 -2.41e-06 5.85e-06
const Rate of Natural Increase 1.823e-06 2.27e-06 0.804 0.421 -2.62e-06 6.27e-06
const Population Density -0.0019 0.003 -0.730 0.465 -0.007 0.003
const Total Fertility Rate 4.847e-07 2.83e-07 1.714 0.087 -6.97e-08 1.04e-06
const Crude Birth Rate 1.278e-05 2.57e-06 4.983 0.000 7.75e-06 1.78e-05
const Infant Mortality Rate, Both Sexes 2.119e-05 6.29e-06 3.366 0.001 8.85e-06 3.35e-05
const Crude Death Rate -3.253e-06 6.72e-07 -4.843 0.000 -4.57e-06 -1.94e-06
const Net Migration Rate -4.994e-06 1.38e-06 -3.618 0.000 -7.7e-06 -2.29e-06
const Births, Both Sexes -4.827e-05 1.61e-05 -2.998 0.003 -7.98e-05 -1.67e-05
const Deaths, Both Sexes 0.0003 5.54e-05 5.787 0.000 0.000 0.000
Year Population 1.088e-09 3.8e-10 2.862 0.004 3.43e-10 1.83e-09
Year Annual Growth Rate 0.0012 0.004 0.286 0.775 -0.007 0.009
Year Rate of Natural Increase 0.0004 0.005 0.081 0.936 -0.009 0.010
Year Population Density 1.898e-06 2.58e-06 0.735 0.462 -3.16e-06 6.96e-06
Year Total Fertility Rate 0.0019 0.000 16.722 0.000 0.002 0.002
Year Crude Birth Rate -0.0005 0.000 -1.279 0.201 -0.001 0.000
Year Infant Mortality Rate, Both Sexes -0.0002 5.17e-06 -40.959 0.000 -0.000 -0.000
Year Crude Death Rate -5.518e-06 0.000 -0.014 0.989 -0.001 0.001
Year Net Migration Rate -0.0002 0.000 -0.391 0.696 -0.001 0.001
Year Births, Both Sexes 4.868e-08 1.17e-08 4.163 0.000 2.58e-08 7.16e-08
Year Deaths, Both Sexes -3.175e-07 5.23e-08 -6.071 0.000 -4.2e-07 -2.15e-07
Population Annual Growth Rate -3.038e-07 3.88e-07 -0.783 0.434 -1.06e-06 4.57e-07
Population Rate of Natural Increase 6.341e-07 4.98e-07 1.273 0.203 -3.42e-07 1.61e-06
Population Population Density -5.146e-11 2.06e-11 -2.497 0.013 -9.19e-11 -1.1e-11
Population Total Fertility Rate -8.485e-08 1.78e-08 -4.758 0.000 -1.2e-07 -4.99e-08
Population Crude Birth Rate -1.944e-08 4.02e-08 -0.483 0.629 -9.83e-08 5.94e-08
Population Infant Mortality Rate, Both Sexes -5.319e-11 4.05e-10 -0.131 0.896 -8.47e-10 7.41e-10
Population Crude Death Rate 3.989e-08 4.05e-08 0.984 0.325 -3.96e-08 1.19e-07
Population Net Migration Rate 3.45e-08 3.88e-08 0.889 0.374 -4.16e-08 1.11e-07
Population Births, Both Sexes -1.136e-15 4.22e-16 -2.695 0.007 -1.96e-15 -3.1e-16
Population Deaths, Both Sexes 3.021e-15 5.53e-16 5.465 0.000 1.94e-15 4.1e-15
Annual Growth Rate Rate of Natural Increase 0.0012 0.000 5.520 0.000 0.001 0.002
Annual Growth Rate Population Density 0.0036 0.004 0.887 0.375 -0.004 0.012
Annual Growth Rate Total Fertility Rate -0.0019 0.001 -2.594 0.010 -0.003 -0.000
Annual Growth Rate Crude Birth Rate 0.0072 0.002 3.448 0.001 0.003 0.011
Annual Growth Rate Infant Mortality Rate, Both Sexes 0.0011 0.000 8.336 0.000 0.001 0.001
Annual Growth Rate Crude Death Rate -0.0049 0.000 -12.681 0.000 -0.006 -0.004
Annual Growth Rate Net Migration Rate 0.0001 1.93e-05 7.450 0.000 0.000 0.000
Annual Growth Rate Births, Both Sexes 1.81e-05 1.09e-05 1.659 0.097 -3.29e-06 3.95e-05
Annual Growth Rate Deaths, Both Sexes 1.076e-05 5.17e-05 0.208 0.835 -9.06e-05 0.000
Rate of Natural Increase Population Density 0.0013 0.005 0.278 0.781 -0.008 0.011
Rate of Natural Increase Total Fertility Rate -0.0002 0.000 -0.822 0.411 -0.001 0.000
Rate of Natural Increase Crude Birth Rate 0.0066 0.002 3.121 0.002 0.002 0.011
Rate of Natural Increase Infant Mortality Rate, Both Sexes 0.0013 0.000 10.028 0.000 0.001 0.002
Rate of Natural Increase Crude Death Rate -0.0054 0.000 -14.195 0.000 -0.006 -0.005
Rate of Natural Increase Net Migration Rate 4.695e-05 0.000 0.412 0.681 -0.000 0.000
Rate of Natural Increase Births, Both Sexes -1.473e-05 1.23e-05 -1.200 0.230 -3.88e-05 9.33e-06
Rate of Natural Increase Deaths, Both Sexes -6.46e-05 6.21e-05 -1.040 0.299 -0.000 5.72e-05
Population Density Total Fertility Rate 7.866e-06 0.000 0.060 0.952 -0.000 0.000
Population Density Crude Birth Rate -0.0005 0.000 -1.158 0.247 -0.001 0.000
Population Density Infant Mortality Rate, Both Sexes -1.603e-05 9.24e-06 -1.734 0.083 -3.41e-05 2.09e-06
Population Density Crude Death Rate 0.0005 0.000 1.269 0.205 -0.000 0.001
Population Density Net Migration Rate -0.0004 0.000 -0.883 0.377 -0.001 0.000
Population Density Births, Both Sexes 8.738e-10 6.85e-10 1.276 0.202 -4.69e-10 2.22e-09
Population Density Deaths, Both Sexes 8.225e-09 2.24e-09 3.673 0.000 3.83e-09 1.26e-08
Total Fertility Rate Crude Birth Rate -0.0106 0.003 -3.772 0.000 -0.016 -0.005
Total Fertility Rate Infant Mortality Rate, Both Sexes -0.0348 0.004 -8.081 0.000 -0.043 -0.026
Total Fertility Rate Crude Death Rate -0.0083 0.000 -17.996 0.000 -0.009 -0.007
Total Fertility Rate Net Migration Rate -0.0165 0.007 -2.500 0.012 -0.029 -0.004
Total Fertility Rate Births, Both Sexes -2.77e-06 4.42e-07 -6.271 0.000 -3.64e-06 -1.9e-06
Total Fertility Rate Deaths, Both Sexes 2.006e-05 2.21e-06 9.097 0.000 1.57e-05 2.44e-05
Crude Birth Rate Infant Mortality Rate, Both Sexes 0.0152 0.001 22.117 0.000 0.014 0.017
Crude Birth Rate Crude Death Rate -0.0447 0.003 -17.426 0.000 -0.050 -0.040
Crude Birth Rate Net Migration Rate 0.0061 0.001 6.466 0.000 0.004 0.008
Crude Birth Rate Births, Both Sexes -1.064e-07 1.16e-06 -0.092 0.927 -2.38e-06 2.17e-06
Crude Birth Rate Deaths, Both Sexes 3.136e-06 5.15e-06 0.609 0.543 -6.96e-06 1.32e-05
Infant Mortality Rate, Both Sexes Crude Death Rate 0.0031 0.001 3.719 0.000 0.001 0.005
Infant Mortality Rate, Both Sexes Net Migration Rate -0.0017 0.000 -7.573 0.000 -0.002 -0.001
Infant Mortality Rate, Both Sexes Births, Both Sexes 1.532e-08 9.55e-09 1.603 0.109 -3.41e-09 3.4e-08
Infant Mortality Rate, Both Sexes Deaths, Both Sexes -5.667e-08 5.16e-08 -1.097 0.273 -1.58e-07 4.46e-08
Crude Death Rate Net Migration Rate 0.0055 0.001 8.339 0.000 0.004 0.007
Crude Death Rate Births, Both Sexes 2.071e-07 1.59e-06 0.130 0.897 -2.91e-06 3.33e-06
Crude Death Rate Deaths, Both Sexes -6.207e-06 5.49e-06 -1.131 0.258 -1.7e-05 4.55e-06
Net Migration Rate Births, Both Sexes -1.979e-06 1.09e-06 -1.814 0.070 -4.12e-06 1.6e-07
Net Migration Rate Deaths, Both Sexes -1.234e-06 5.17e-06 -0.239 0.811 -1.14e-05 8.9e-06
Births, Both Sexes Deaths, Both Sexes -4.673e-14 6.9e-14 -0.677 0.498 -1.82e-13 8.86e-14
==============================================================================
Omnibus: 104.211 Durbin-Watson: 2.012
Prob(Omnibus): 0.000 Jarque-Bera (JB): 148.524
Skew: -0.299 Prob(JB): 5.60e-33
Kurtosis: 3.778 Cond. No. 1.97e+17
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.97e+17. This might indicate that there are
strong multicollinearity or other numerical problems.
In [61]:
# getting the model residuals
residuals = poly_model.resid
# setting the figure size
plt.figure(figsize=(12, 6))
# plotting the residuals
sns.histplot(residuals, bins=50, kde=True, color="blue", edgecolor="black")
# decorating the plot
plt.title("Histogram of Target Variable (Life Expectancy)", fontsize=16)
plt.xlabel("Life Expectancy", fontsize=14)
plt.ylabel("Frequency", fontsize=14)
# Showing the plot
plt.show()
In [62]:
# Create a 2x2 grid of subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Residuals vs Fitted Plot
resid = poly_model.resid
fitted = poly_model.fittedvalues
lowess = sm.nonparametric.lowess(resid, fitted, frac=0.3)
axes[0, 0].scatter(fitted, resid, alpha=0.7)
axes[0, 0].plot(lowess[:, 0], lowess[:, 1], color='red', linestyle='--', linewidth=1.5)
axes[0, 0].axhline(0, color='black', linestyle='--')
axes[0, 0].set_xlabel('Fitted Values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted')
# Q-Q Plot
sm.qqplot(poly_model.resid, line='45', fit=True, ax=axes[0, 1])
axes[0, 1].set_title('Q-Q Plot')
# Scale-Location Plot (Standardized Residuals vs Fitted Values)
standardized_residuals = poly_model.resid_pearson
axes[1, 0].scatter(poly_model.fittedvalues, standardized_residuals, alpha=0.7)
axes[1, 0].axhline(0, color='red', linestyle='--')
axes[1, 0].set_xlabel('Fitted Values')
axes[1, 0].set_ylabel('Standardized Residuals')
axes[1, 0].set_title('Scale-Location Plot')
# Leverage vs Residuals (Cook's Distance)
sm.graphics.influence_plot(poly_model, criterion="cooks", ax=axes[1, 1], size=5)
axes[1, 1].set_title("Leverage vs Residuals")
axes[1, 1].set_xlabel('Fitted Values', size=11)
axes[1, 1].set_ylabel('Standardized Residuals', size=11)
# Adding Cook's Distance lines
n = poly_model.nobs
p = poly_model.df_model + 1
x = np.linspace(0.001, max(poly_model.get_influence().hat_matrix_diag) * 0.9, 100)
for d in [0.5, 1]:
y = np.sqrt(np.maximum(0, d * (p * (1 - x) / x)))
axes[1, 1].plot(x, y, color='red', linestyle='--', lw=1)
axes[1, 1].plot(x, -y, color='red', linestyle='--', lw=1)
# Setting the plot range for cooks distance
# NOTE: I am manually adjusting these limits. Take caution when doing this yourself.
# Start with no limits and then zoom in until you find an appropriate fit.
#axes[1, 1].set_xlim(0, 1)
axes[1, 1].set_ylim(-20, 20)
# Adjust layout to prevent overlapping
plt.tight_layout()
# Show the combined plots
plt.show()
These assumptions for the polynomial model look much better! The Leverage vs Residuals plot has identified some points that are pulling the model and can possibly affect the models coefficients, but it's no strictly speaking an assumption of a linear model. Let's check how the model score on the testing data.
In [63]:
# Creating the same interaction terms as X_train
X_test_interaction = poly.transform(X_test)
# converting to a dataframe
X_test_interaction_df = pd.DataFrame(X_test_interaction, columns=interaction_feature_names, index=X_test.index)
# generating our predictions
y_pred = poly_model.predict(X_test_interaction_df)
# calculating the RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
# calculating R^2 (goodness of fit)
r2 = r2_score(y_test, y_pred)
# Plotting actual vs predicted values
plt.scatter(y_test, y_pred, alpha=0.7)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], '--r', linewidth=2)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.suptitle('Actual vs Predicted')
plt.title(f'RMSE = {rmse:.3f} || R² = {r2:.3f}')
plt.show()
The degree 2 Polynomial Model scored better than the basic linear model with 94.6% variability explained. The assumptions of the linear model also did much better, and I would consider them a pass.
Let's keep exploring models! Next, let's remove the multicollinearity from our data and see if we get even more impressive results.
For this linear regression model, we are going to take the same approach as the basic model, but we are going to remove the features that are causing high multicollinearity. We can check our feature set's multicollinearity by checking the feature's variance inflation factor. We will iteratively remove the high VIF features one at a time until all VIF's are under 10, a "reasonable" VIF value. You can adjust this hyperparameter as much as you like, and test models with different VIF values, but a VIF of 10 is what we are using here.
For more information on variance inflation factors, check out this article by Data Camp's Vikash Singh linked below.
Variance Inflation Factor (VIF): Addressing Multicollinearity in Regression Analysis (external link)
In [64]:
# Checking Multicollinearity to see if its effecting high R^2 values
from statsmodels.stats.outliers_influence import variance_inflation_factor
# initializing a dataframe
vif = pd.DataFrame()
# Looping through X features and getting VIF Values.
# X was set at the beginning of Model Engineering section
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["Feature"] = X.columns
print(vif)
VIF Feature 0 82213.776390 const 1 1.053007 Year 2 40.086780 Population 3 492313.820258 Annual Growth Rate 4 70276.225018 Rate of Natural Increase 5 1.043891 Population Density 6 30.122885 Total Fertility Rate 7 51713.386889 Crude Birth Rate 8 6.144328 Infant Mortality Rate, Both Sexes 9 4239.957737 Crude Death Rate 10 450795.403283 Net Migration Rate 11 9.376623 Births, Both Sexes 12 37.637937 Deaths, Both Sexes
In [65]:
# Creating a function to drop high VIF values one at a time.
def drop_high_vif(X, vif_value=10):
while True:
vif = pd.DataFrame()
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["Feature"] = X.columns
max_vif = vif["VIF"].max()
if max_vif < vif_value:
# Stop if all VIF values are below the threshold
break
# Drop the feature with the highest VIF
feature_to_drop = vif.loc[vif["VIF"].idxmax(), "Feature"]
print(f"Dropping feature: {feature_to_drop} with VIF: {max_vif}")
X = X.drop(columns=[feature_to_drop])
return X
X_no_mc = drop_high_vif(X)
Dropping feature: Annual Growth Rate with VIF: 492313.8202576999 Dropping feature: const with VIF: 82211.8045721277 Dropping feature: Crude Birth Rate with VIF: 211355.81499620437 Dropping feature: Total Fertility Rate with VIF: 135.9581097188382 Dropping feature: Population with VIF: 42.24213180558822 Dropping feature: Year with VIF: 19.070216948041374
In [66]:
# Printing our new VIF values
vif = pd.DataFrame()
vif["VIF"] = [variance_inflation_factor(X_no_mc.values, i) for i in range(X_no_mc.shape[1])]
vif["Feature"] = X_no_mc.columns
print(vif)
VIF Feature 0 4.951165 Rate of Natural Increase 1 1.048942 Population Density 2 6.417025 Infant Mortality Rate, Both Sexes 3 2.622324 Crude Death Rate 4 1.001691 Net Migration Rate 5 9.167904 Births, Both Sexes 6 8.934938 Deaths, Both Sexes
In [67]:
# Adding the bias term for capturing the intercepts theta
X_no_mc = sm.add_constant(X_no_mc)
y = df["Life Expectancy at Birth, Both Sexes"]
# Using the same split as the previous models with rs 37
X_train_no_mc, X_test_no_mc, y_train_no_mc, y_test_no_mc = train_test_split(
X_no_mc, y, test_size=0.33, random_state=37)
# Training the model
no_mc_model = sm.OLS(y_train_no_mc, X_train_no_mc).fit()
In [68]:
# Printing the model sumamry
print(no_mc_model.summary())
OLS Regression Results
================================================================================================
Dep. Variable: Life Expectancy at Birth, Both Sexes R-squared: 0.884
Model: OLS Adj. R-squared: 0.884
Method: Least Squares F-statistic: 4031.
Date: Tue, 10 Dec 2024 Prob (F-statistic): 0.00
Time: 20:06:17 Log-Likelihood: -9066.2
No. Observations: 3701 AIC: 1.815e+04
Df Residuals: 3693 BIC: 1.820e+04
Df Model: 7
Covariance Type: nonrobust
=====================================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------------
const 85.2519 0.201 423.770 0.000 84.857 85.646
Rate of Natural Increase -2.4146 0.096 -25.180 0.000 -2.603 -2.227
Population Density 0.0004 2.54e-05 15.734 0.000 0.000 0.000
Infant Mortality Rate, Both Sexes -0.1871 0.004 -41.920 0.000 -0.196 -0.178
Crude Death Rate -0.7183 0.023 -30.569 0.000 -0.764 -0.672
Net Migration Rate 0.0056 0.002 3.408 0.001 0.002 0.009
Births, Both Sexes 1.395e-07 6.42e-08 2.171 0.030 1.35e-08 2.65e-07
Deaths, Both Sexes -4.069e-07 1.41e-07 -2.887 0.004 -6.83e-07 -1.31e-07
==============================================================================
Omnibus: 201.751 Durbin-Watson: 2.016
Prob(Omnibus): 0.000 Jarque-Bera (JB): 375.846
Skew: -0.405 Prob(JB): 2.43e-82
Kurtosis: 4.334 Cond. No. 1.11e+07
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.11e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
In [69]:
# Gathering the residuals
residuals = no_mc_model.resid
# Setting the figure size
plt.figure(figsize=(8, 6))
# Creating the histogram to check for residual normality
sns.histplot(residuals, bins=50, kde=True, color="blue", edgecolor="black")
# decorating the plot
plt.title("Histogram of Target Variable (Life Expectancy)", fontsize=16)
plt.xlabel("Life Expectancy", fontsize=14)
plt.ylabel("Frequency", fontsize=14)
# showing the plot
plt.show()
In [70]:
# creating the grid of subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Residuals vs Fitted Plot
resid = no_mc_model.resid
fitted = no_mc_model.fittedvalues
lowess = sm.nonparametric.lowess(resid, fitted, frac=0.3)
axes[0, 0].scatter(fitted, resid, alpha=0.7)
axes[0, 0].plot(lowess[:, 0], lowess[:, 1], color='red', linestyle='--', linewidth=1.5)
axes[0, 0].axhline(0, color='black', linestyle='--')
axes[0, 0].set_xlabel('Fitted Values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted')
# Q-Q Plot
sm.qqplot(no_mc_model.resid, line='45', fit=True, ax=axes[0, 1])
axes[0, 1].set_title('Q-Q Plot')
# Scale-Location Plot (Standardized Residuals vs Fitted Values)
standardized_residuals = no_mc_model.resid_pearson
axes[1, 0].scatter(no_mc_model.fittedvalues, standardized_residuals, alpha=0.7)
axes[1, 0].axhline(0, color='red', linestyle='--')
axes[1, 0].set_xlabel('Fitted Values')
axes[1, 0].set_ylabel('Standardized Residuals')
axes[1, 0].set_title('Scale-Location Plot')
# Leverage vs Residuals (Cook's Distance)
sm.graphics.influence_plot(no_mc_model, criterion="cooks", ax=axes[1, 1], size=5)
axes[1, 1].set_title("Leverage vs Residuals")
axes[1, 1].set_xlabel('Fitted Values', size=11)
axes[1, 1].set_ylabel('Standardized Residuals', size=11)
# Adding Cook's Distance lines
n = no_mc_model.nobs
p = no_mc_model.df_model + 1
x = np.linspace(0.001, max(no_mc_model.get_influence().hat_matrix_diag) * 0.9, 100)
for d in [0.5, 1]:
y = np.sqrt(np.maximum(0, d * (p * (1 - x) / x)))
axes[1, 1].plot(x, y, color='red', linestyle='--', lw=1)
axes[1, 1].plot(x, -y, color='red', linestyle='--', lw=1)
# Setting the plot range for cooks distance
# NOTE: I am manually adjusting these limits. Take caution when doing this yourself.
# Start with no limits and then zoom in until you find an appropriate fit.
#axes[1, 1].set_xlim(0, 1)
axes[1, 1].set_ylim(-20, 20)
# Adjust layout to prevent overlapping
plt.tight_layout()
# Show the combined plots
plt.show()
In [71]:
# creating the y-hat predictions
y_pred = no_mc_model.predict(X_test_no_mc)
# calculating the RMSE
rmse = np.sqrt(mean_squared_error(y_test_no_mc, y_pred))
# calculating R^2 score
r2 = r2_score(y_test_no_mc, y_pred)
# creating the residual prediction score plot
plt.scatter(y_test_no_mc, y_pred, alpha=0.7)
plt.plot([y_test_no_mc.min(), y_test_no_mc.max()], [y_test_no_mc.min(), y_test_no_mc.max()], '--r', linewidth=2)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.suptitle('Actual vs Predicted')
plt.title(f'RMSE = {rmse:.3f} || R² = {r2:.3f}')
plt.show()
Removing multicollinearity from the model resulted in an R-squared score of just 0.885 on the test data, making it the worst performing model on the unseen test data so far. But why exactly was the R-squared lower than the models with multicollinearity? If multicollinearity is problematic, why did the model perform worse on unseen data after removing high VIF features?
Chetna Khanna, in her article 'Multicollinearity — Why is it bad?' on Towards Data Science, provides insight into this question. She explains the following:
Chetna Khanna, in her article 'Multicollinearity — Why is it bad?' on Towards Data Science, provides insight into this question. She explains the following:
"In short, multicollinearity is a problem for causal inference or creates difficulties in casual inference but it is not a problem for prediction or forecasting."
See Chetna Khanna's article linked here:
This means that although multicollinearity may not affect model performace, it makes the individual feature results untrustworthy. These individual feature results in reference are the coef and significance values (p-values) that tell us more about the features effects on the target variable. If the goal of your project is to determine the significance of predictors on the target variable, then handling multicollinearity is essential.
In our case, removing the multicollinearity removed some underlying relationships and essentially decreased our predictive power. Whether to address multicollinearity all depends on the goal you are trying to achieve.
In our case, removing the multicollinearity removed some underlying relationships and essentially decreased our predictive power. Whether to address multicollinearity all depends on the goal you are trying to achieve.
Scaling the predictors before feeding them into a model is an effective way to handle differences in feature scales. For example, in our dataset, some values are measured per 1,000 people, while others are raw counts. Scaling the input matrix before passing it to the model is often essential for achieving optimal results. Our results have been strong so far, but let’s see if scaling the data can further enhance our predictive power!
In [72]:
from sklearn.preprocessing import StandardScaler
# Removing the bias so it doesnt get scaled, we still want bias to be ones.
X_train_no_const = X_train.drop(columns=["const"])
X_test_no_const = X_test.drop("const", axis=1)
# Scaling the features
scaler_poly = StandardScaler()
X_train_scaled =scaler_poly.fit_transform(X_train_no_const)
X_test_scaled = scaler_poly.transform(X_test_no_const)
# converting the scaled arrarys into dataframes
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train_no_const.columns, index=X_train_no_const.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test_no_const.columns, index=X_test_no_const.index)
# adding the constant back in as ones
X_train_scaled = sm.add_constant(X_train_scaled)
X_test_scaled = sm.add_constant(X_test_scaled)
# creating the feature interactions(degree=2 interactions, bias already included in X)
poly_scale = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_interaction = poly_scale.fit_transform(X_train_scaled)
# gathering the interaction terms
interaction_feature_names = poly_scale.get_feature_names_out(X_train_scaled.columns)
# creating the interaction dataframe
X_interaction_df = pd.DataFrame(X_interaction, columns=interaction_feature_names, index=X_train_scaled.index)
# Fit the linear regression model with interaction terms
scaled_poly_model = sm.OLS(y_train, X_interaction_df).fit()
# Print the summary of the model
print(scaled_poly_model.summary())
OLS Regression Results
================================================================================================
Dep. Variable: Life Expectancy at Birth, Both Sexes R-squared: 0.959
Model: OLS Adj. R-squared: 0.959
Method: Least Squares F-statistic: 1099.
Date: Tue, 10 Dec 2024 Prob (F-statistic): 0.00
Time: 20:06:28 Log-Likelihood: -7126.0
No. Observations: 3701 AIC: 1.441e+04
Df Residuals: 3622 BIC: 1.490e+04
Df Model: 78
Covariance Type: nonrobust
==============================================================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------------------------------------------
const 70.4073 0.575 122.513 0.000 69.281 71.534
Year 0.1024 0.016 6.542 0.000 0.072 0.133
Population 0.3157 16.499 0.019 0.985 -32.033 32.664
Annual Growth Rate 4.6817 9.097 0.515 0.607 -13.154 22.518
Rate of Natural Increase 2.4518 3.855 0.636 0.525 -5.107 10.010
Population Density 0.8149 0.148 5.491 0.000 0.524 1.106
Total Fertility Rate 2.5215 0.148 17.020 0.000 2.231 2.812
Crude Birth Rate -7.8954 3.497 -2.258 0.024 -14.752 -1.039
Infant Mortality Rate, Both Sexes -2.4827 0.059 -42.438 0.000 -2.597 -2.368
Crude Death Rate -0.6585 1.011 -0.651 0.515 -2.641 1.324
Net Migration Rate -4.3661 8.588 -0.508 0.611 -21.204 12.472
Births, Both Sexes -2.2625 15.055 -0.150 0.881 -31.780 27.255
Deaths, Both Sexes 2.9174 8.085 0.361 0.718 -12.933 18.768
const Year 0.1024 0.016 6.542 0.000 0.072 0.133
const Population 0.3157 16.499 0.019 0.985 -32.033 32.664
const Annual Growth Rate 4.6817 9.097 0.515 0.607 -13.154 22.518
const Rate of Natural Increase 2.4518 3.855 0.636 0.525 -5.107 10.010
const Population Density 0.8149 0.148 5.491 0.000 0.524 1.106
const Total Fertility Rate 2.5215 0.148 17.020 0.000 2.231 2.812
const Crude Birth Rate -7.8954 3.497 -2.258 0.024 -14.752 -1.039
const Infant Mortality Rate, Both Sexes -2.4827 0.059 -42.438 0.000 -2.597 -2.368
const Crude Death Rate -0.6585 1.011 -0.651 0.515 -2.641 1.324
const Net Migration Rate -4.3661 8.588 -0.508 0.611 -21.204 12.472
const Births, Both Sexes -2.2625 15.055 -0.150 0.881 -31.780 27.255
const Deaths, Both Sexes 2.9174 8.085 0.361 0.718 -12.933 18.768
Year Population 0.3619 0.353 1.026 0.305 -0.330 1.054
Year Annual Growth Rate 17.6553 18.202 0.970 0.332 -18.031 53.342
Year Rate of Natural Increase -12.7114 7.494 -1.696 0.090 -27.404 1.981
Year Population Density -0.0490 0.031 -1.569 0.117 -0.110 0.012
Year Total Fertility Rate 0.7100 0.180 3.941 0.000 0.357 1.063
Year Crude Birth Rate 5.3829 6.587 0.817 0.414 -7.532 18.298
Year Infant Mortality Rate, Both Sexes 0.0663 0.075 0.880 0.379 -0.081 0.214
Year Crude Death Rate -2.0028 1.903 -1.052 0.293 -5.734 1.728
Year Net Migration Rate -16.0713 17.185 -0.935 0.350 -49.764 17.622
Year Births, Both Sexes 0.6389 0.180 3.540 0.000 0.285 0.993
Year Deaths, Both Sexes -1.3450 0.368 -3.653 0.000 -2.067 -0.623
Population Annual Growth Rate -140.7166 149.045 -0.944 0.345 -432.936 151.503
Population Rate of Natural Increase 106.4075 68.955 1.543 0.123 -28.787 241.602
Population Population Density -6.4468 4.498 -1.433 0.152 -15.266 2.372
Population Total Fertility Rate -1.5494 3.178 -0.488 0.626 -7.780 4.681
Population Crude Birth Rate -45.3525 54.364 -0.834 0.404 -151.939 61.234
Population Infant Mortality Rate, Both Sexes -0.4215 1.199 -0.352 0.725 -2.773 1.930
Population Crude Death Rate 11.6350 15.615 0.745 0.456 -18.981 42.251
Population Net Migration Rate 144.6544 140.684 1.028 0.304 -131.174 420.483
Population Births, Both Sexes -0.1416 0.107 -1.329 0.184 -0.350 0.067
Population Deaths, Both Sexes 0.2303 0.064 3.619 0.000 0.106 0.355
Annual Growth Rate Rate of Natural Increase 9.4318 51.927 0.182 0.856 -92.377 111.241
Annual Growth Rate Population Density 4.2251 18.376 0.230 0.818 -31.804 40.254
Annual Growth Rate Total Fertility Rate 137.5488 91.467 1.504 0.133 -41.783 316.881
Annual Growth Rate Crude Birth Rate -137.9312 108.886 -1.267 0.205 -351.414 75.552
Annual Growth Rate Infant Mortality Rate, Both Sexes -4.7653 42.839 -0.111 0.911 -88.757 79.226
Annual Growth Rate Crude Death Rate -25.9796 26.253 -0.990 0.322 -77.451 25.492
Annual Growth Rate Net Migration Rate 0.0056 0.002 2.591 0.010 0.001 0.010
Annual Growth Rate Births, Both Sexes 54.4505 68.839 0.791 0.429 -80.516 189.417
Annual Growth Rate Deaths, Both Sexes 116.7816 152.928 0.764 0.445 -183.052 416.615
Rate of Natural Increase Population Density 3.9634 7.464 0.531 0.595 -10.671 18.598
Rate of Natural Increase Total Fertility Rate -62.0856 35.734 -1.737 0.082 -132.147 7.976
Rate of Natural Increase Crude Birth Rate 43.3773 37.518 1.156 0.248 -30.181 116.936
Rate of Natural Increase Infant Mortality Rate, Both Sexes 4.1362 17.219 0.240 0.810 -29.623 37.896
Rate of Natural Increase Crude Death Rate 12.0358 8.787 1.370 0.171 -5.192 29.264
Rate of Natural Increase Net Migration Rate -2.4755 50.186 -0.049 0.961 -100.872 95.921
Rate of Natural Increase Births, Both Sexes -4.9567 27.586 -0.180 0.857 -59.043 49.129
Rate of Natural Increase Deaths, Both Sexes -115.9372 65.455 -1.771 0.077 -244.270 12.395
Population Density Total Fertility Rate 0.2901 0.323 0.899 0.369 -0.342 0.923
Population Density Crude Birth Rate -6.0891 6.795 -0.896 0.370 -19.411 7.233
Population Density Infant Mortality Rate, Both Sexes 0.0063 0.389 0.016 0.987 -0.757 0.770
Population Density Crude Death Rate 1.5948 1.944 0.820 0.412 -2.217 5.407
Population Density Net Migration Rate -4.0735 17.369 -0.235 0.815 -38.128 29.981
Population Density Births, Both Sexes -3.2290 2.437 -1.325 0.185 -8.007 1.549
Population Density Deaths, Both Sexes 15.9914 3.666 4.362 0.000 8.804 23.179
Total Fertility Rate Crude Birth Rate 15.5618 22.000 0.707 0.479 -27.573 58.696
Total Fertility Rate Infant Mortality Rate, Both Sexes -6.9434 0.388 -17.916 0.000 -7.703 -6.184
Total Fertility Rate Crude Death Rate 1.1842 6.346 0.187 0.852 -11.259 13.627
Total Fertility Rate Net Migration Rate -129.9348 86.366 -1.504 0.133 -299.266 39.397
Total Fertility Rate Births, Both Sexes -4.5421 1.335 -3.401 0.001 -7.160 -1.924
Total Fertility Rate Deaths, Both Sexes 7.7448 3.011 2.572 0.010 1.841 13.648
Crude Birth Rate Infant Mortality Rate, Both Sexes 6.5869 12.812 0.514 0.607 -18.533 31.707
Crude Birth Rate Crude Death Rate -9.0278 0.326 -27.725 0.000 -9.666 -8.389
Crude Birth Rate Net Migration Rate 124.9555 102.637 1.217 0.224 -76.276 326.187
Crude Birth Rate Births, Both Sexes -12.7938 24.455 -0.523 0.601 -60.741 35.153
Crude Birth Rate Deaths, Both Sexes 63.9588 51.961 1.231 0.218 -37.916 165.834
Infant Mortality Rate, Both Sexes Crude Death Rate 1.7077 3.693 0.462 0.644 -5.532 8.948
Infant Mortality Rate, Both Sexes Net Migration Rate 3.7087 40.443 0.092 0.927 -75.584 83.001
Infant Mortality Rate, Both Sexes Births, Both Sexes 2.5990 0.465 5.584 0.000 1.686 3.512
Infant Mortality Rate, Both Sexes Deaths, Both Sexes -3.0433 1.144 -2.660 0.008 -5.286 -0.800
Crude Death Rate Net Migration Rate 26.7254 25.207 1.060 0.289 -22.697 76.148
Crude Death Rate Births, Both Sexes 6.8841 9.677 0.711 0.477 -12.089 25.857
Crude Death Rate Deaths, Both Sexes -21.5416 15.993 -1.347 0.178 -52.898 9.815
Net Migration Rate Births, Both Sexes -59.7211 64.966 -0.919 0.358 -187.095 67.653
Net Migration Rate Deaths, Both Sexes -115.8706 144.387 -0.802 0.422 -398.959 167.218
Births, Both Sexes Deaths, Both Sexes -0.0949 0.129 -0.737 0.461 -0.347 0.158
==============================================================================
Omnibus: 63.758 Durbin-Watson: 2.000
Prob(Omnibus): 0.000 Jarque-Bera (JB): 123.933
Skew: -0.050 Prob(JB): 1.23e-27
Kurtosis: 3.891 Cond. No. 2.86e+16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 3.73e-27. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
In [73]:
# gathering the residuals
residuals = scaled_poly_model.resid
# Setting the plot size
plt.figure(figsize=(8, 6))
# Creating and showing the plot
sns.histplot(residuals, bins=50, kde=True, color="blue", edgecolor="black")
plt.title("Histogram of Target Variable (Life Expectancy)", fontsize=16)
plt.xlabel("Life Expectancy", fontsize=14)
plt.ylabel("Frequency", fontsize=14)
plt.show()
In [74]:
# Create a 2x2 grid of subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Residuals vs Fitted Plot
resid = scaled_poly_model.resid
fitted = scaled_poly_model.fittedvalues
lowess = sm.nonparametric.lowess(resid, fitted, frac=0.3)
axes[0, 0].scatter(fitted, resid, alpha=0.7)
axes[0, 0].plot(lowess[:, 0], lowess[:, 1], color='red', linestyle='--', linewidth=1.5)
axes[0, 0].axhline(0, color='black', linestyle='--')
axes[0, 0].set_xlabel('Fitted Values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted')
# Q-Q Plot
sm.qqplot(scaled_poly_model.resid, line='45', fit=True, ax=axes[0, 1])
axes[0, 1].set_title('Q-Q Plot')
# Scale-Location Plot (Standardized Residuals vs Fitted Values)
standardized_residuals = scaled_poly_model.resid_pearson
axes[1, 0].scatter(scaled_poly_model.fittedvalues, standardized_residuals, alpha=0.7)
axes[1, 0].axhline(0, color='red', linestyle='--')
axes[1, 0].set_xlabel('Fitted Values')
axes[1, 0].set_ylabel('Standardized Residuals')
axes[1, 0].set_title('Scale-Location Plot')
# Leverage vs Residuals (Cook's Distance)
sm.graphics.influence_plot(scaled_poly_model, criterion="cooks", ax=axes[1, 1], size=5)
axes[1, 1].set_title("Leverage vs Residuals")
axes[1, 1].set_xlabel('Fitted Values', size=11)
axes[1, 1].set_ylabel('Standardized Residuals', size=11)
# Adding Cook's Distance lines
n = scaled_poly_model.nobs
p = scaled_poly_model.df_model + 1
x = np.linspace(0.001, max(scaled_poly_model.get_influence().hat_matrix_diag) * 0.9, 100)
for d in [0.5, 1]:
y = np.sqrt(np.maximum(0, d * (p * (1 - x) / x)))
axes[1, 1].plot(x, y, color='red', linestyle='--', lw=1)
axes[1, 1].plot(x, -y, color='red', linestyle='--', lw=1)
# Setting the plot range for cooks distance
# NOTE: I am manually adjusting these limits. Take caution when doing this yourself.
# Start with no limits and then zoom in until you find an appropriate fit.
#axes[1, 1].set_xlim(0, 1)
axes[1, 1].set_ylim(-20, 20)
# Adjust layout to prevent overlapping
plt.tight_layout()
# Show the combined plots
plt.show()
In [75]:
# Apply the same polynomial interaction transformation to the test set
X_test_interaction = poly_scale.transform(X_test_scaled)
interaction_feature_names = poly_scale.get_feature_names_out(X_test_scaled.columns)
# Convert back to DataFrame with the same feature names
X_test_interaction_df = pd.DataFrame(X_test_interaction, columns=interaction_feature_names, index=X_test_scaled.index)
# Make predictions on the test set
y_pred = scaled_poly_model.predict(X_test_interaction_df)
# Calculate the RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
# Calculate R^2
r2 = r2_score(y_test, y_pred)
# Plot Actual vs Predicted
plt.scatter(y_test, y_pred, alpha=0.7)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], '--r', linewidth=2)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.suptitle('Actual vs Predicted')
plt.title(f'RMSE = {rmse:.3f} || R² = {r2:.3f}')
plt.show()
Scaling the predictors produced the most powerful predictive model yet with an explained variability in the target variable of 95.2%, while also looking pretty good on the assumptions of a linear model.
The methods we’ve used so far to create models involve fitting a single training set and testing it on a single testing set. While this approach isn’t necessarily flawed, it doesn’t always provide the most reliable indication of how a model will perform on unseen data. To obtain a better measure of model performance, we can use cross-validation, which provides a more accurate estimate of our predictive power. Let's use the same model we just created, but use cross-validation to get a better approximation of our predictive power.
For a more in-depth explanation of cross-validation and its Python implementation, see this article hosted on Geeks for Geeks:
The code chunk below defines a custom cross-validation function,"cv_scores", designed to evaluate the performance of our scaled polynomial regression model with interaction terms.
In [76]:
# Importing necessary libraries
from sklearn.model_selection import cross_val_score, cross_val_predict, KFold
# Creating a function to implement cv on scaled data for a polynomial linear model of degree 2
def cv_scores(X, y, degree=2, folds=10):
kf = KFold(n_splits=folds, shuffle=True, random_state=3)
# initializing empty lists for scores
rmse_scores = []
r2_scores = []
# looping over the k-folds
for train_index, val_index in kf.split(X):
# Gathing the training data based on the current folds indicies
X_train, X_val = X.iloc[train_index], X.iloc[val_index]
y_train, y_val = y.iloc[train_index], y.iloc[val_index]
# Removing the bias so it doesnt get scaled, we still want bias to be ones.
X_train_no_const = X_train.drop(columns=["const"])
X_val_no_const = X_val.drop("const", axis=1)
# scaling the current fold
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_no_const)
X_val_scaled = scaler.transform(X_val_no_const)
# converting the scaled data back to a dataframe
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train_no_const.columns, index=X_train_no_const.index)
X_val_scaled = pd.DataFrame(X_val_scaled, columns=X_val_no_const.columns, index=X_val_no_const.index)
# adding the bias term back
X_train_with_const = sm.add_constant(X_train_scaled)
X_val_with_const = sm.add_constant(X_val_scaled)
# Creating the interactions
poly = PolynomialFeatures(degree=degree, interaction_only=True, include_bias=False)
X_train_poly = poly.fit_transform(X_train_with_const)
X_val_poly = poly.transform(X_val_with_const)
# converting the interactions back into a df
interaction_feature_names = poly.get_feature_names_out(X_train_with_const.columns)
X_train_poly_df = pd.DataFrame(X_train_poly, columns=interaction_feature_names, index=X_train_with_const.index)
X_val_poly_df = pd.DataFrame(X_val_poly, columns=interaction_feature_names, index=X_val_with_const.index)
# fitting the model
cv_poly_scaled_model = sm.OLS(y_train, X_train_poly_df).fit()
# making the predictions
y_val_pred = cv_poly_scaled_model.predict(X_val_poly_df)
# Calculate the RMSE and R^2 for the current fold and adding them to the scores list
rmse_scores.append(np.sqrt(mean_squared_error(y_val, y_val_pred)))
r2_scores.append(r2_score(y_val, y_val_pred))
# Returning the mean rmse score and r^2 score
return np.mean(rmse_scores), np.mean(r2_scores)
# Using our cv function to get the mean rmse and r^2 from our 10 folds
rmse, r2 = cv_scores(X_train, y_train, degree=2, folds=10)
print(f"Cross-Validated RMSE: {rmse:.3f}")
print(f"Cross-Validated R²: {r2:.3f}")
Cross-Validated RMSE: 1.981 Cross-Validated R²: 0.931
The scores from cross-validation are slightly lower, as expected when using this method. Cross-validation metrics provide more reliable estimates of model accuracy on unseen data, making them a trustworthy benchmark. With that being said, achieving an R² of 0.931, or 93.1% explained variability is fantastic and indicates that this is a strong model.
Ridge and Lasso Regression are extensions of Ordinary Least Squares (OLS) that incorporates a regularization term to enhance model performance. Ridge Regression helps control the impact of outliers by shrinking large coefficients, while Lasso not only regularizes but also performs feature selection by shrinking some coefficients to zero. These techniques refine linear models and often lead to more robust and reliable results.
Ridge Regression:
- Shrinks large coefficients by applying an L2 regularization penalty (the sum of squared coefficients).
- All features remain in the model; no coefficients are reduced to zero.
- Helps prevent overfitting, especially when features are correlated or have large magnitudes.
Lasso Regression:
- Uses an L1 regularization penalty (the sum of the absolute values of coefficients).
- Can shrink some coefficients to exactly zero, effectively removing irrelevant features from the model.
- Performs feature selection in addition to regularization, making it ideal for high-dimensional datasets with many irrelevant predictors.
This article from Aarshay Jain on Analytics Vidhya provides an excellent explanation and walk-through of these two methods:
In [77]:
# Importing ridge
from sklearn.linear_model import Ridge
# Removing the bias so it doesnt get scaled, we still want bias to be ones.
X_train_no_const = X_train.drop(columns=["const"])
X_test_no_const = X_test.drop("const", axis=1)
# Ridge regession works best with scaled data
scaler_rl = StandardScaler()
X_train_scaled = scaler_rl.fit_transform(X_train_no_const)
X_test_scaled = scaler_rl.transform(X_test_no_const)
# adding the bias column back in after scaling
X_train_scaled = sm.add_constant(X_train_scaled)
X_test_scaled = sm.add_constant(X_test_scaled)
# fitting Ridge regression with a random alpha
ridge = Ridge(alpha=10.0, solver="auto")
ridge.fit(X_train_scaled, y_train)
# making predictions
y_test_pred = ridge.predict(X_test_scaled)
# evaluating the model
rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
r2 = r2_score(y_test, y_test_pred)
# printing the scores
print(f"Test RMSE: {rmse:.3f}")
print(f"Test R²: {r2:.3f}")
Test RMSE: 2.416 Test R²: 0.914
An R² of 0.914 isn't the best score we've seen, but if you notice in the code above, we just picked a random alpha value for our regularization term. There are more appropriate ways to choose the alpha we use, such as demonstrated in the following code chunk using Grid Search.
In [78]:
#importing grid search
from sklearn.model_selection import GridSearchCV
# Creating a list of alpha values to check in our grid search
param_grid = {'alpha': np.arange(0.1, 100.5, 0.1)}
# Searching for the best alpha value
ridge_cv = GridSearchCV(Ridge(), param_grid, cv=10, scoring='neg_mean_squared_error')
ridge_cv.fit(X_train_scaled, y_train)
# Printing the best alpha and score
print(f"Best alpha: {ridge_cv.best_params_['alpha']}")
print(f"Best RMSE score: {round(np.sqrt(-ridge_cv.best_score_),3)}")
Best alpha: 0.5 Best RMSE score: 2.395
In [79]:
# Fit Ridge regression with the optimal alpha
ridge = Ridge(alpha=0.5, solver="auto")
ridge.fit(X_train_scaled, y_train)
# predicting y-hat
y_test_pred = ridge.predict(X_test_scaled)
# Checking the models scores
rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
r2 = r2_score(y_test, y_test_pred)
print(f"Test RMSE: {rmse:.3f}")
print(f"Test R²: {r2:.3f}")
Test RMSE: 2.408 Test R²: 0.915
In [80]:
# Checking the feature importance decided by Ridge Regression
feature_importance = pd.DataFrame({
'Feature': X_train.columns,
'Coefficient': ridge.coef_
}).sort_values(by='Coefficient', ascending=False)
print(feature_importance.reset_index(drop=True))
Feature Coefficient 0 Total Fertility Rate 7.492922 1 Net Migration Rate 1.006639 2 Population 0.701532 3 Population Density 0.645700 4 Year 0.389827 5 Births, Both Sexes 0.133352 6 const 0.000000 7 Deaths, Both Sexes -0.931505 8 Annual Growth Rate -1.017386 9 Crude Death Rate -3.444583 10 Infant Mortality Rate, Both Sexes -3.733471 11 Rate of Natural Increase -4.295023 12 Crude Birth Rate -5.921974
Using Grid Search to find the optimal alpha value of 0.5, only increased our model slightly to an R² of 0.915. The above code chunk also shows the feature importance determined by the Ridge model.
Let's give Lasso a try on our scaled data and see if we get better results.
In [81]:
# importing lasso
from sklearn.linear_model import Lasso
# Creating a list of alpha values to check in our grid search
param_grid = {'alpha': np.arange(0.1, 100.5, 0.1)}
# Finding the best alpha for Lasso
lasso_cv = GridSearchCV(Lasso(max_iter=10000), param_grid, cv=5, scoring='neg_mean_squared_error')
# Fitting the lasso model (note the scaled data)
lasso_cv.fit(X_train_scaled, y_train)
# printing the best alpha
best_alpha = lasso_cv.best_params_['alpha']
print(f"Best alpha for Lasso: {best_alpha}")
# Best CV score (convert from negative MSE to RMSE)
best_cv_rmse = (-lasso_cv.best_score_) ** 0.5
print(f"Best Cross-Validated RMSE: {best_cv_rmse:.3f}")
Best alpha for Lasso: 0.1 Best Cross-Validated RMSE: 2.640
In [82]:
# Fit Lasso with the best alpha
lasso = Lasso(alpha=best_alpha, max_iter=10000)
lasso.fit(X_train_scaled, y_train)
# Make predictions on the test set
y_test_pred = lasso.predict(X_test_scaled)
# Evaluate RMSE and R² on test data
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
test_r2 = r2_score(y_test, y_test_pred)
print(f"Test RMSE: {test_rmse:.3f}")
print(f"Test R²: {test_r2:.3f}")
Test RMSE: 2.652 Test R²: 0.897
In [83]:
# Feature importance (non-zero coefficients)
lasso_coefficients = pd.DataFrame({
'Feature': X_train.columns,
'Coefficient': lasso.coef_
}).sort_values(by='Coefficient', ascending=False)
# Display non-zero coefficients
non_zero_features = lasso_coefficients[lasso_coefficients['Coefficient'] != 0]
print(f"Number of features selected by Lasso: {len(non_zero_features)}")
print(non_zero_features.reset_index(drop=True))
Number of features selected by Lasso: 8
Feature Coefficient
0 Total Fertility Rate 1.617017
1 Population Density 0.654702
2 Year 0.404901
3 Net Migration Rate 0.045656
4 Deaths, Both Sexes -0.023996
5 Crude Death Rate -1.561320
6 Crude Birth Rate -4.196048
7 Infant Mortality Rate, Both Sexes -4.402785
The Lasso model performed worse than Ridge regression and most of our linear models without regularization, achieving an R² value of 0.897 on the test data. With an L1 regularization alpha value of 0.1, the Lasso model selected 8 out of the 13 features, ranking them as shown above.
A decision tree is a non-parametric supervised learning algorithm, which is utilized for both classification and regression tasks. It has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes.
Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that works by creating a multitude of decision trees during training.
The Random Forest model is one of the most powerful tools available to a data scientist. It can handle diverse types of input data, requires minimal feature engineering, and often delivers excellent results. However, one of its key drawbacks is its complexity. On small datasets like ours, Random Forest performs efficiently, but with millions of data points, the model can become overly complex and resource intensive, offering diminishing returns. For projects like ours, however, it is an excellent choice for unlocking strong predictive power. Let’s see a Random Forest model in action.
In [84]:
# importing random forest from sklearn
from sklearn.ensemble import RandomForestRegressor
X_train_no_const = X_train.drop("const",axis = 1)
X_test_no_const = X_test.drop("const",axis = 1)
# creating the random forest object with set hyperparameters
rf = RandomForestRegressor(n_estimators=100, random_state=37, max_depth=None)
# fitting the random forest model with the training data
rf.fit(X_train_no_const, y_train)
# making predictions for both the training and testing data
y_train_pred = rf.predict(X_train_no_const)
y_test_pred = rf.predict(X_test_no_const)
# evaluating the model for the training data
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
train_r2 = r2_score(y_train, y_train_pred)
# evaluating the model for the testing data
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
test_r2 = r2_score(y_test, y_test_pred)
# printing the results
print(f"Training RMSE: {train_rmse:.3f}")
print(f"Training R²: {train_r2:.3f}")
print(f"Testing RMSE: {test_rmse:.5f}")
print(f"Testing R²: {test_r2:.5f}")
Training RMSE: 0.296 Training R²: 0.999 Testing RMSE: 0.79448 Testing R²: 0.99075
Those are some very impressive scores, especially with minimal data preparation! 99% of the variability in the target variable explained, even on the testing data! Let’s apply the same Grid Search method we used for Ridge and Lasso to tune the hyperparameters and see if we can achieve even better results.
In [85]:
import time
start_time = time.time()
# creaing the parameter grid to use in grid search
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [10, 20, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]}
# initializing the grid search with the above parameters
rf_grid = GridSearchCV(
RandomForestRegressor(random_state=37),
param_grid,
cv=10,
scoring='neg_mean_squared_error',
n_jobs=-1)
# fitting the model grid model
rf_grid.fit(X_train_no_const, y_train)
# printing the best hyper parameters
print("Best parameters:", rf_grid.best_params_)
# printing the best RMSE score
best_cv_rmse = (-rf_grid.best_score_) ** 0.5
print(f"Best Cross-Validated RMSE: {best_cv_rmse:.3f}")
# fitting the model with the best parameters
best_rf = rf_grid.best_estimator_
# making predictions using the best model
y_test_pred = best_rf.predict(X_test_no_const)
# calculating the scores
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
test_r2 = r2_score(y_test, y_test_pred)
# Minimizing MSE at each split
print(f"Criterion used: {best_rf.criterion}")
# printing the best scores
print(f"Test RMSE: {test_rmse:.5f}")
print(f"Test R²: {test_r2:.5f}")
# adding some space
print()
print()
# Printing how long it took to complete
end_time = time.time()
print(f"Time taken: {end_time - start_time:.5f} seconds")
Best parameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
Best Cross-Validated RMSE: 0.808
Criterion used: squared_error
Test RMSE: 0.79448
Test R²: 0.99075
Time taken: 257.62408 seconds
Searching for better parameters didn't improve the test set's R² value, but it's tough to get a better score than 0.99075. We can check the feature importance of the Random Forest model, as shown below. Infant Mortality Rate was the most important feature, followed by Crude Death Rate. Infant Mortality makes sense as the most important feature since it was 92% negatively correlated to life expectancy. Crude death rate had a negative 45% correlation to life expectancy.
In [86]:
# RF feature importance
importances = best_rf.feature_importances_
features = X_train_no_const.columns
# sorting by importance
indices = np.argsort(importances)[::-1]
# creating wrapped x labels for readability
wrapped_labels = [textwrap.fill(features[i], width=15, break_long_words=False) for i in indices]
# plotting the feature importance
plt.figure(figsize=(12, 6))
plt.bar(range(X_train_no_const.shape[1]), importances[indices], align="center")
plt.xticks(range(X_train_no_const.shape[1]), wrapped_labels, rotation=90)
plt.title("Feature Importances")
plt.xlabel("Features")
plt.ylabel("Importance")
plt.tight_layout()
plt.show()
We can't easily determine what is going on inside of our Random Forest model, because it is an ensemble of decision trees. However, we can take a closer look at the very first decision tree in the ensembly of trees.
In [87]:
from sklearn.tree import export_text
# extracting the very first tree in the ensemble
tree = best_rf.estimators_[0]
# Getting a text based output of the tree with a depth of 2
tree_text = export_text(tree, feature_names=list(X_train_no_const.columns), max_depth=2)
print(tree_text)
|--- Infant Mortality Rate, Both Sexes <= 37.95 | |--- Infant Mortality Rate, Both Sexes <= 8.65 | | |--- Crude Death Rate <= 12.05 | | | |--- truncated branch of depth 22 | | |--- Crude Death Rate > 12.05 | | | |--- truncated branch of depth 14 | |--- Infant Mortality Rate, Both Sexes > 8.65 | | |--- Infant Mortality Rate, Both Sexes <= 26.65 | | | |--- truncated branch of depth 20 | | |--- Infant Mortality Rate, Both Sexes > 26.65 | | | |--- truncated branch of depth 18 |--- Infant Mortality Rate, Both Sexes > 37.95 | |--- Crude Death Rate <= 10.95 | | |--- Crude Death Rate <= 7.85 | | | |--- truncated branch of depth 14 | | |--- Crude Death Rate > 7.85 | | | |--- truncated branch of depth 18 | |--- Crude Death Rate > 10.95 | | |--- Crude Death Rate <= 13.85 | | | |--- truncated branch of depth 12 | | |--- Crude Death Rate > 13.85 | | | |--- truncated branch of depth 14
We can see that the very first split in the decision tree is based on infant mortality rate. If the infant mortality rate is less than 37.95, then the next two major splits following are both for infant mortality rate. If at the first split the infant mortality rate is greater than 37.95, then the the next two splits are on Crude Death Rate.
We can even go one step further and visualize the entire decision tree. I have created the code to do so below, so if you would like to see the entire tree, click the link after the following code chunk.
In [88]:
# To complete this yourself, you will need to install graphviz.
# https://graphviz.org/download/. Need to download and install the executable
# select add environment path for all users duing installation. Then restart the kernal and run this code.
from sklearn.tree import export_graphviz
import graphviz
# Export the tree to DOT format
dot_data = export_graphviz(
tree,
out_file=None,
feature_names=X_train_no_const.columns,
filled=True,
rounded=True)
# save the data to the graph object
graph = graphviz.Source(dot_data)
# Save as SVG for web
graph.render("decision_tree", format="svg")
Out[88]:
'decision_tree.svg'
Check out the entire first decision tree in the RF ensemble. Be prepared to zoom out!
The final model we will build is a Long Short-Term Memory (LSTM) Neural Network. Neural networks are incredibly powerful tools that have revolutionized many industries over the past decade. An LSTM is a specialized type of Recurrent Neural Network (RNN) designed to excel at modeling time-series data and sequential patterns.
Building an LSTM model is not much more complicated than our previous model, the process is quite similar. Take a moment to review the code chunk below to familiarize yourself with the structure and design of an LSTM model.
Building an LSTM model is not much more complicated than our previous model, the process is quite similar. Take a moment to review the code chunk below to familiarize yourself with the structure and design of an LSTM model.
Check out the Youtube video from StatQuest's Josh Starmer for a detailed explanation.
In [89]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Input
# Setting a seed for reproducability
np.random.seed(37)
df_sorted = df.sort_values(by="Year").reset_index(drop=True)
# Removing the categorical data and target variable from our input Matrix
X_nn = df.drop(["Life Expectancy at Birth, Both Sexes",'GENC','Name'], axis=1)
# creating our target feature variable
y_nn = df["Life Expectancy at Birth, Both Sexes"]
# splitting the data for test
X_train, X_test, y_train, y_test = train_test_split(
X_nn, y_nn,
test_size=.33,
random_state=37,
shuffle=False)
# Initialize the scaler for features
scaler_X_nn = StandardScaler()
# Fit on training data and transform
X_train_scaled = scaler_X_nn.fit_transform(X_train)
# Transform test data
X_test_scaled = scaler_X_nn.transform(X_test)
#Reshape Data for LSTM
timesteps = 1 # Single timestep since no sequence data
n_features = X_train_scaled.shape[1] # Number of features
# Reshaping the input matrix
X_train_scaled = X_train_scaled.reshape((X_train_scaled.shape[0], timesteps, n_features))
X_test_scaled = X_test_scaled.reshape((X_test_scaled.shape[0], timesteps, n_features))
# building the model
nn_model = Sequential([
Input(shape=(timesteps, n_features)), # Explicit Input layer
LSTM(50, activation='relu'),
Dense(1)],name="Life_Expectancy_Model")
nn_model.compile(optimizer='adam', loss='mse')
nn_model.summary()
#Training the model with 1000 epoches. Feel free to set to any number you which
# The larger the number, the longer it will take. Turn verbose to 1 below
# to see progress.
epochs = 1000
batch_size = 16
# Fitting the model. Set the verbose to 1 when running on your own.
history = nn_model.fit(
X_train_scaled,
y_train,
epochs=epochs,
batch_size=batch_size,
validation_split=0.1,
verbose=0)
# making predictions
y_pred = nn_model.predict(X_test_scaled).flatten()
# Evaluatin the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R-squared (R2): {r2:.2f}")
y_test_aligned = pd.Series(y_test).reset_index(drop=True)
y_pred_aligned = pd.Series(y_pred).reset_index(drop=True)
plt.figure(figsize=(12, 6))
plt.plot(y_test_aligned, label='Actual Life Expectancy', marker='o', linestyle='None')
plt.plot(y_pred_aligned, label='Predicted Life Expectancy', marker='x', linestyle='None')
plt.title('Actual vs Predicted Life Expectancy at Birth, Both Sexes')
plt.xlabel('Sample Index')
plt.ylabel('Life Expectancy at Birth, Both Sexes')
plt.legend()
plt.grid(True)
plt.show()
Model: "Life_Expectancy_Model"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ lstm (LSTM) │ (None, 50) │ 12,600 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense (Dense) │ (None, 1) │ 51 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 12,651 (49.42 KB)
Trainable params: 12,651 (49.42 KB)
Non-trainable params: 0 (0.00 B)
57/57 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step Mean Squared Error (MSE): 5.56 R-squared (R2): 0.88
The LSTM model scored about ~90% on it's R² score, by far not the best, but still a very good score!
A major part of the data science pipeline is being able to take your data and turn it into actionable future predictions. These predictions will never be 100% accurate, but that is why in the model engineering phase we searched for models that performed well on the test data and had high cross-validation scores. Models that recieved high R² scores on the test data, especially cross-validation, should generalize better on new, unseen data. The challenge with studying real-world topics like human life expectancy, however, is that we are attempting to predict outcomes using data that has yet to be generated. By the time we have the actual data to feed into our model, the real-world results for life expectancy will already be known. There really is only one solution to this lack of data, we must generate our own data.
There are various methods for generating new data to feed into a model, and the process can quickly become complex. If generating accurate future data were easy, many of the world’s challenges would already be solved. One common approach for generating future data in a time-dependent context is called Time Series Forecasting. While powerful, it often involves intricate techniques and assumptions. For the sake of simplicity in this tutorial, we’ll take a slightly different approach. If you’re interested in exploring Time Series Forecasting further, check out the resource from Train in Data linked below.
In this tutorial, we will generate our data using a simple linear model for each feature, modeled individually against Year for each country. While this approach may not always be valid depending on your dataset, we will first evaluate its feasibility by calculating the correlation between each feature and Year at the country level. In the next code chunk, we’ll compute these correlations for each country individually and then calculate the mean correlation across all countries. Unlike the global correlation matrix we calculated earlier, this method provides granular insights by isolating the relationship between each feature and Year on a per-country basis.
In [90]:
# Getting the list of features to check correlations
features = list(df.columns[2:].values)
# Initializing the mean correlation dataframe
country_corr_df = pd.DataFrame(columns=["Country"] + features)
# Looping through each country to calculate correlations
for country in df['Name'].unique():
country_data = df[df['Name'] == country]
mean_corr_list = [country]
for feature in features:
# Checking for sufficient variability in the countries feature
if country_data[feature].nunique() > 1 and country_data["Year"].nunique() > 1:
# calculating the correlation for the feature vs year
correlation = country_data['Year'].corr(country_data[feature])
# adding the correlation value to the mean corr list
mean_corr_list.append(correlation if pd.notnull(correlation) else np.nan)
# appending nan if there wasnt enough variability in the data
# very few small countries need this
else:
mean_corr_list.append(np.nan)
# adding the country correlations to the dataframe
country_corr_df.loc[len(country_corr_df)] = mean_corr_list
# calculating the mean correlation of all countries for every feature
mean_correlations = country_corr_df.iloc[:, 1:].mean()
# creating a dataframe of the granular corrleations correlations
mean_correlations_df = mean_correlations.reset_index()
mean_correlations_df.columns = ["Feature", "Mean Correlation"]
mean_correlations_df
Out[90]:
| Feature | Mean Correlation | |
|---|---|---|
| 0 | Year | 1.000000 |
| 1 | Population | 0.723630 |
| 2 | Annual Growth Rate | -0.372863 |
| 3 | Rate of Natural Increase | -0.673199 |
| 4 | Population Density | 0.725892 |
| 5 | Total Fertility Rate | -0.537820 |
| 6 | Crude Birth Rate | -0.756287 |
| 7 | Life Expectancy at Birth, Both Sexes | 0.896897 |
| 8 | Infant Mortality Rate, Both Sexes | -0.890217 |
| 9 | Crude Death Rate | -0.048395 |
| 10 | Net Migration Rate | 0.142402 |
| 11 | Births, Both Sexes | -0.005255 |
| 12 | Deaths, Both Sexes | 0.485378 |
The granular correlations in the table above tell a very different story compared to the global correlations between features and Year. While the absolute mean global correlation was just 9.5%, analyzing the correlations at the country level reveals a significantly higher absolute mean correlation of 52.2%. This indicates much stronger relationships when considering countries individually, even with some features lowering the overall mean.
The following code chunk will iteratively generate new data for every country for the years 2025-2050 using simple linear regression. We can expand the years further into the future, but as we do so, we will make continuously more unreliable predictions.
In [91]:
from sklearn.linear_model import LinearRegression
# Setting a seed for reproducability
np.random.seed(37)
# defining the prediction range (feel free to play around with the years)
future_years = np.arange(2025, 2051)
# Initializing a list to hold predictions
country_predictions = []
# grouping the data by country the data by country
grouped_data = df.groupby("Name")
# Looping over each country
for country_name, group in grouped_data:
# Country prediction dataframes
country_future_data = {"Year": future_years}
# Looping over the countries features
features_to_predict = group.drop(columns=["Year", "Name", "GENC"]).columns
for feature in features_to_predict:
# Prepare the data to be fed into a linear model
years = group["Year"].values.reshape(-1, 1)
feature_values = group[feature].values
# Training the linear regression model for the specific country and feature
pred_model = LinearRegression()
pred_model.fit(years, feature_values)
# Making the predictions for the feature for the feature range defined above
predictions = pred_model.predict(future_years.reshape(-1, 1))
country_future_data[feature] = predictions
# creating a dataframe of the country predictions
country_future_df = pd.DataFrame(country_future_data)
country_future_df["Name"] = country_name
# adding the country data to country predictions
country_predictions.append(country_future_df)
# Creating a dataframe of all predictions and assigning the country name to the first column
future_predictions_df = pd.concat(country_predictions, ignore_index=True)
columns = ["Name"] + [col for col in future_predictions_df.columns if col != "Name"]
future_predictions_df = future_predictions_df[columns]
# Showing the last 5 rows of the dataframe
future_predictions_df.tail(5) # View the last few rows
Out[91]:
| Name | Year | Population | Annual Growth Rate | Rate of Natural Increase | Population Density | Total Fertility Rate | Crude Birth Rate | Life Expectancy at Birth, Both Sexes | Infant Mortality Rate, Both Sexes | Crude Death Rate | Net Migration Rate | Births, Both Sexes | Deaths, Both Sexes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5715 | Zimbabwe | 2046 | 2.034011e+07 | 3.7450 | 3.234092 | 52.623846 | 3.279246 | 27.642154 | 92.870154 | 11.914462 | -4.717077 | 5.085538 | 583592.961538 | -5629.06 |
| 5716 | Zimbabwe | 2047 | 2.052373e+07 | 3.8141 | 3.263754 | 53.100077 | 3.262777 | 27.474923 | 93.932923 | 10.916769 | -5.181462 | 5.479231 | 587011.639231 | -10261.25 |
| 5717 | Zimbabwe | 2048 | 2.070734e+07 | 3.8832 | 3.293415 | 53.576308 | 3.246308 | 27.307692 | 94.995692 | 9.919077 | -5.645846 | 5.872923 | 590430.316923 | -14893.44 |
| 5718 | Zimbabwe | 2049 | 2.089096e+07 | 3.9523 | 3.323077 | 54.052538 | 3.229838 | 27.140462 | 96.058462 | 8.921385 | -6.110231 | 6.266615 | 593848.994615 | -19525.63 |
| 5719 | Zimbabwe | 2050 | 2.107457e+07 | 4.0214 | 3.352738 | 54.528769 | 3.213369 | 26.973231 | 97.121231 | 7.923692 | -6.574615 | 6.660308 | 597267.672308 | -24157.82 |
That’s it! We now have data for the years 2025–2050, ready to be fed into our models to make future predictions. We’ll follow the same steps as before to generate predictions using our newly created data.
Every step in our predictions must follow exactly the same steps we took to create the model. If we added a bias, then we must add the bias back. If we removed the bias, standardized the data, then added the bias back, we must follow this same exact strict pipeline.
A data pipeline can encompass various topics in data science. In the context of this tutorial, it refers to the specific, sequential transformations and processes applied to the data. While we’re skipping over the automation of this pipeline here, you can explore more about building automated pipelines on Hazelcast, linked below:
Every step in our predictions must follow exactly the same steps we took to create the model. If we added a bias, then we must add the bias back. If we removed the bias, standardized the data, then added the bias back, we must follow this same exact strict pipeline.
A data pipeline can encompass various topics in data science. In the context of this tutorial, it refers to the specific, sequential transformations and processes applied to the data. While we’re skipping over the automation of this pipeline here, you can explore more about building automated pipelines on Hazelcast, linked below:
In [92]:
# Creating our feature matrix and target variable
X = future_predictions_df.drop(["Life Expectancy at Birth, Both Sexes", "Name"], axis=1)
y = future_predictions_df['Life Expectancy at Birth, Both Sexes']
Basic Linear Model Prediction
In [93]:
# Setting the figure size
plt.figure(figsize=(12,7))
# adding the bias
X_with_const = sm.add_constant(X, has_constant="add")
# plotting all predictions
plt.plot(
X["Year"],
basic_model.predict(X_with_const),
label="Individual Country Predictions",
linestyle="solid",
linewidth=0.5,
color="blue",
alpha=0.4)
# Calculating the mean prediction for each year
future_predictions_df["Predicted Life Expectancy"] = basic_model.predict(X_with_const)
mean_predictions = (future_predictions_df.groupby("Year")["Predicted Life Expectancy"].mean())
# Plotting the mean prediction line
plt.plot(
mean_predictions.index,
mean_predictions.values,
label="Mean Prediction",
linestyle="dashed",
linewidth=2.5,
color="red")
# decorating the plot
plt.xlabel("Year")
plt.ylabel("Life Expectancy")
plt.title("Basic Linear Model Predictions", size=20)
plt.legend()
plt.grid(True)
# Show the plot
plt.show()
# Getting the high and low countries
target_year = future_predictions_df['Year'] == 2050
tar_pred = future_predictions_df['Predicted Life Expectancy']
print(f"High Life Expectancies: {list(future_predictions_df['Name'][(tar_pred > 90) & (target_year)].unique())}")
print(f"Low Life Expectancies: {list(future_predictions_df['Name'][(tar_pred < 70) & (target_year)].unique())}")
High Life Expectancies: ['Angola', 'Azerbaijan', 'Macao', 'Northern Mariana Islands', 'Qatar', 'Rwanda', 'Singapore', 'Uganda', 'Zambia'] Low Life Expectancies: ['Central African Republic', 'Somalia', 'Yemen']
Scaled Polynomial Model Prediction
In [94]:
# Setting the figure size
plt.figure(figsize=(12,7))
# Pipeline of transformations
X_future = X
X_future_scaled = scaler_poly.transform(X_future)
X_future_scaled = pd.DataFrame(X_future_scaled, columns=X_future.columns, index=X_future.index)
X_future_scaled = sm.add_constant(X_future_scaled)
X_test_interaction = poly_scale.transform(X_future_scaled)
interaction_feature_names = poly_scale.get_feature_names_out(X_future_scaled.columns)
X_test_interaction_df = pd.DataFrame(X_test_interaction, columns=interaction_feature_names, index=X_future_scaled.index)
# Individual country predictions
plt.plot(
X["Year"],
scaled_poly_model.predict(X_test_interaction_df),
label="Individual Countries",
linestyle="solid",
linewidth=0.5,
color="blue",
alpha=0.4)
# Calculating the mean prediction for each year
future_predictions_df["Predicted Life Expectancy"] = scaled_poly_model.predict(X_test_interaction_df)
mean_predictions = (future_predictions_df.groupby("Year")["Predicted Life Expectancy"].mean())
# Plotting the mean prediction
plt.plot(
mean_predictions.index,
mean_predictions.values,
label="Mean Prediction",
linestyle="dashed",
linewidth=2.5,
color="red")
# Decorating the plot
plt.xlabel("Year")
plt.ylabel("Life Expectancy")
plt.title("Polynomial Scaled Model Prediction", size=20)
plt.legend()
plt.grid(True)
# displaying the plot
plt.show()
# Getting the high and low countries
target_year = future_predictions_df['Year'] == 2050
tar_pred = future_predictions_df['Predicted Life Expectancy']
print(f"High Life Expectancies: {list(future_predictions_df['Name'][(tar_pred > 100) & (target_year)].unique())}")
print(f"Low Life Expectancies: {list(future_predictions_df['Name'][(tar_pred < 70) & (target_year)].unique())}")
High Life Expectancies: ['Azerbaijan', 'Malawi', 'Monaco', 'Northern Mariana Islands', 'Rwanda', 'Saint Pierre and Miquelon', 'Zambia'] Low Life Expectancies: ['Georgia', 'Somalia']
Ridge Model Prediction
In [95]:
# Setting the plot size
plt.figure(figsize=(12,7))
# Transformation pipeline
X_future = X
X_future_scaled = scaler_rl.transform(X_future)
X_future_scaled = sm.add_constant(X_future_scaled)
# Individual country predictions
plt.plot(
X["Year"],
ridge.predict(X_future_scaled),
label="Individual Countries",
linestyle="solid",
linewidth=0.5,
color="blue",
alpha=0.4)
# Calculating the mean predictions for each year
future_predictions_df["Predicted Life Expectancy"] = ridge.predict(X_future_scaled)
mean_predictions = (future_predictions_df.groupby("Year")["Predicted Life Expectancy"].mean())
# Plotting the mean predictions
plt.plot(
mean_predictions.index,
mean_predictions.values,
label="Mean Prediction",
linestyle="dashed",
linewidth=2.5,
color="red")
# Decorating the plot
plt.xlabel("Year")
plt.ylabel("Life Expectancy")
plt.title("Ridge Predictions", size=20)
plt.legend()
plt.grid(True)
# displaying the plot
plt.show()
# Getting the high and low countries
target_year = future_predictions_df['Year'] == 2050
tar_pred = future_predictions_df['Predicted Life Expectancy']
print(f"High Life Expectancies: {list(future_predictions_df['Name'][(tar_pred > 90) & (target_year)].unique())}")
print(f"Low Life Expectancies: {list(future_predictions_df['Name'][(tar_pred < 70) & (target_year)].unique())}")
High Life Expectancies: ['Angola', 'Azerbaijan', 'Macao', 'Northern Mariana Islands', 'Qatar', 'Rwanda', 'Singapore', 'Uganda', 'Zambia'] Low Life Expectancies: ['Central African Republic', 'Somalia', 'Yemen']
Lasso Model Prediction
In [96]:
plt.figure(figsize=(12,7))
# Transformation pipeline
X_future = X
X_future_scaled = scaler_rl.transform(X_future)
X_future_scaled = sm.add_constant(X_future_scaled)
# Individual country predictions
plt.plot(
X["Year"],
lasso.predict(X_future_scaled),
label="Individual Countries",
linestyle="solid",
linewidth=0.5,
color="blue",
alpha=0.4)
# Calculating the mean predictions for each year
future_predictions_df["Predicted Life Expectancy"] = lasso.predict(X_future_scaled)
mean_predictions = (future_predictions_df.groupby("Year")["Predicted Life Expectancy"].mean())
# Plotting the mean predictions
plt.plot(
mean_predictions.index,
mean_predictions.values,
label="Mean Prediction",
linestyle="dashed",
linewidth=2.5,
color="red")
# Decorating the plot
plt.xlabel("Year")
plt.ylabel("Life Expectancy")
plt.title("Lasso Predictions", size = 20)
plt.legend()
plt.grid(True)
# displaying the plot
plt.show()
# Getting the high and low countries
target_year = future_predictions_df['Year'] == 2050
tar_pred = future_predictions_df['Predicted Life Expectancy']
print(f"High Life Expectancies: {list(future_predictions_df['Name'][(tar_pred > 90) & (target_year)].unique())}")
print(f"Low Life Expectancies: {list(future_predictions_df['Name'][(tar_pred < 70) & (target_year)].unique())}")
High Life Expectancies: ['Angola', 'Azerbaijan', 'Macao', 'Malawi', 'Nepal', 'Rwanda', 'Uganda', 'Zambia'] Low Life Expectancies: ['Somalia']
Random Forest Model Prediction
In [97]:
plt.figure(figsize=(12,7))
# Individual country predictions
plt.plot(
X["Year"],
best_rf.predict(X),
label="Individual Countries",
linestyle="solid",
linewidth=0.5,
color="blue",
alpha=0.4)
# Calculating the mean predictions for each year
future_predictions_df["Predicted Life Expectancy"] = best_rf.predict(X)
mean_predictions = (future_predictions_df.groupby("Year")["Predicted Life Expectancy"].mean())
# Plotting the mean predictions
plt.plot(
mean_predictions.index,
mean_predictions.values,
label="Mean Prediction",
linestyle="dashed",
linewidth=2.5,
color="red")
# Decorating the plot
plt.xlabel("Year")
plt.ylabel("Life Expectancy")
plt.title("Random Forest Predictions", size=20)
plt.legend()
plt.grid(True)
# displaying the plot
plt.show()
# Getting the high and low countries
target_year = future_predictions_df['Year'] == 2050
tar_pred = future_predictions_df['Predicted Life Expectancy']
print(f"High Life Expectancies: {list(future_predictions_df['Name'][(tar_pred > 85) & (target_year)].unique())}")
print(f"Low Life Expectancies: {list(future_predictions_df['Name'][(tar_pred < 70) & (target_year)].unique())}")
High Life Expectancies: ['Macao', 'Singapore'] Low Life Expectancies: ['Somalia']
Long Short-Term Memory (LSTM) Model Predictions
In [98]:
# Setting a seed for reproducability
np.random.seed(37)
# Grabbing the column names
feature_columns = list(X_nn.columns)
# Getting the future prediction dataframe we created early
X_new = future_predictions_df[feature_columns].values
# scaling the features using the fitted nn scaler
X_new_scaled = scaler_X_nn.transform(pd.DataFrame(X_new, columns=feature_columns))
# defining hyperparameters timesteps and features
timesteps = 1
n_features = X_new_scaled.shape[1]
# correcting the shape of the input
X_new_scaled = X_new_scaled.reshape((X_new_scaled.shape[0], timesteps, n_features))
# creating the predictions
y_new_pred = nn_model.predict(X_new_scaled).flatten()
# adding the predictions to the dataframe
future_predictions_df["Predicted Life Expectancy"] = y_new_pred
# Calculating the mean prediction per year
mean_predictions = (future_predictions_df.groupby("Year")["Predicted Life Expectancy"].mean())
# plotting all predictions
plt.figure(figsize=(12, 7))
for country, group in future_predictions_df.groupby("Name"):
plt.plot(
group["Year"],
group["Predicted Life Expectancy"],
color="blue",
alpha=0.2,
linewidth=0.5)
# plotting the mean prediction
plt.plot(
mean_predictions.index,
mean_predictions.values,
label="Mean Prediction",
linestyle="dashed",
linewidth=2.5,
color="red")
# decorating the plot
plt.xlabel("Year")
plt.ylabel("Life Expectancy")
plt.title("LSTM Predictions", size=20)
plt.legend(["Individual Countries", "Mean Prediction"])
plt.grid(True)
# displaying the plot
plt.show()
# Getting the high and low countries
target_year = future_predictions_df['Year'] == 2050
tar_pred = future_predictions_df['Predicted Life Expectancy']
print(f"High Life Expectancies: {list(future_predictions_df['Name'][(tar_pred > 120) & (target_year)].unique())}")
print(f"Low Life Expectancies: {list(future_predictions_df['Name'][(tar_pred < 70) & (target_year)].unique())}")
179/179 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step
High Life Expectancies: ['Angola', 'Azerbaijan', 'Bhutan', 'Montserrat', 'Rwanda', 'Uganda', 'Zambia'] Low Life Expectancies: ['Algeria', 'Djibouti', 'Germany', 'Kiribati', 'Macao', 'Malaysia', 'Monaco', 'Panama', 'Papua New Guinea', 'Suriname', 'Timor Leste', 'Turkmenistan', 'Tuvalu', 'Uzbekistan', 'Venezuela']
The Report Summary phase of the data science pipeline is a critical stage where you present your findings into a concise and clear report that can be effectively communicated to stakeholders. Even the best analysis in the world will fail to make an impact if the insights are not communicated properly. In this phase you will pull out key insights from your analysis to tell the story of your data. If the story is too overwhelming, you risk losing the stakeholders' attention, but if it's too underwhelming, your results may leave more questions than answers. The goal is to tell a complete story as succinctly as possible while being prepared to address follow-up questions. One such follow-up question could revolve around the specific data cleaning steps you took in your pipeline, so it could be recommended to include those details at the end of your presentation or report so they are available if stakeholders ask for them. If there are no questions from the stakeholders then that’s great, but it’s always best to be prepared. For more information on creating reports for stakholders, check out this article from Optimal Workshop on creating reports and presenting data to stakeholders, linked below.
Since this is a tutorial designed for the educational purpose of teaching the data science pipeline, I won’t create a full stakeholder report. However, I’ve provided a brief textual summary of our findings below.
Summary of Findings
In this project, we analyzed data from 220 countries world-wide from the year 2000 to 2024 to identify underlying trends that affect human life expectancy. The data was collected from the United States Census Bureau’s International Database, linked here: US Census Bureau IDB. We found that several of our dataset’s features were highly correlated to human life expectancy, with the strongest relationship occurring as a negative correlation with infant mortality rate.
Crude birth rate and total fertility rate also had a strong negative correlation to human life expectancy. When analyzing the crude birth rate against total fertility rate, we found a very strong positive correlation of 98%, as well as an obvious pattern of countries in Africa having the highest birth rate and highest expectation of children per woman. Furthermore, analysis showed that India and China comprise of 35% of the world’s population, but doesn’t even rank in the top 40 countries when comparing births per 1,000 people. One highly likely explanation for this is due to the high crude death rate and infant mortality rate on the African continent, as revealed in our analysis. Countries with high infant mortality rates had strikingly lower life expectancies, with the African continent displaying this correlation the strongest.
The mean life expectancy in 2000 was 68.39 and increased to 75.1 by 2024. When we checked the data using various machine learning models, the Random Forest model performed the best with an achieved Test R² of 0.99075, or 99% explained variability in human life expectancy. Random Forest determined the most important feature and appropriate root node was infant mortality rate, followed by splits on either infant mortality rate or crude death rate, the second most important feature determined by RF.
Finally, we predicted human life expectancy using 6 different machine learning algorithms until the year 2050, and received mean life expectancy results for the year 2050 ranging from 80 years to about 95 years, with 5 of the 6 models giving results between 80 years and 88 years for life expectancy. The model with the best score, Random Forest, predicted human life expectancy will be around 81 years in 2050.
Country specific life expectancies were also calculated, but due to the rapid gain in human life expectancy from developing nations during this time period, the results were slightly skewed to favor these developing nations in the future. The Random Forest Model was the only model that appropriately limited these developing nations from dominantly taking top spots in human life expectancy. Random Forest determined that the two countries with the highest life expectancies in 2050 were Macao and Singapore. The country with the lowest human life expectancy in 2050 is predicted to be Somalia.
Crude birth rate and total fertility rate also had a strong negative correlation to human life expectancy. When analyzing the crude birth rate against total fertility rate, we found a very strong positive correlation of 98%, as well as an obvious pattern of countries in Africa having the highest birth rate and highest expectation of children per woman. Furthermore, analysis showed that India and China comprise of 35% of the world’s population, but doesn’t even rank in the top 40 countries when comparing births per 1,000 people. One highly likely explanation for this is due to the high crude death rate and infant mortality rate on the African continent, as revealed in our analysis. Countries with high infant mortality rates had strikingly lower life expectancies, with the African continent displaying this correlation the strongest.
The mean life expectancy in 2000 was 68.39 and increased to 75.1 by 2024. When we checked the data using various machine learning models, the Random Forest model performed the best with an achieved Test R² of 0.99075, or 99% explained variability in human life expectancy. Random Forest determined the most important feature and appropriate root node was infant mortality rate, followed by splits on either infant mortality rate or crude death rate, the second most important feature determined by RF.
Finally, we predicted human life expectancy using 6 different machine learning algorithms until the year 2050, and received mean life expectancy results for the year 2050 ranging from 80 years to about 95 years, with 5 of the 6 models giving results between 80 years and 88 years for life expectancy. The model with the best score, Random Forest, predicted human life expectancy will be around 81 years in 2050.
Country specific life expectancies were also calculated, but due to the rapid gain in human life expectancy from developing nations during this time period, the results were slightly skewed to favor these developing nations in the future. The Random Forest Model was the only model that appropriately limited these developing nations from dominantly taking top spots in human life expectancy. Random Forest determined that the two countries with the highest life expectancies in 2050 were Macao and Singapore. The country with the lowest human life expectancy in 2050 is predicted to be Somalia.